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Abstract 


The objective of this work is to automatically recognize faces from video sequences in 
a realistic, unconstrained setup in which illumination conditions are extreme and greatly 
changing, viewpoint and user motion pattern have a wide variability, and video input is 
of low quality. At the centre of focus are face appearance manifolds: this thesis presents 
a significant advance of their understanding and application in the sphere of face recogni¬ 
tion. The two main contributions are the Generic Shape-Illumination Manifold recognition 
algorithm and the Anisotropic Manifold Space clustering. 

The Generic Shape-Illumination Manifold algorithm shows how video sequences of un¬ 
constrained head motion can be reliably compared in the presence of greatly changing imag¬ 
ing conditions. Our approach consists of combining a priori domain-specific knowledge in 
the form of a photometric model of image formation, with a statistical model of generic face 
appearance variation. One of the key ideas is the reillumination algorithm which takes two 
sequences of faces and produces a third, synthetic one, that contains the same poses as the 
first in the illumination of the second. 

The Anisotropic Manifold Space clustering algorithm is proposed to automatically de¬ 
termine the cast of a feature-length film, without any dataset-specific training information. 
The method is based on modelling coherence of dissimilarities between appearance mani¬ 
folds: it is shown how inter- and intra-personal similarities can be exploited by mapping 
each manifold into a single point in the manifold space. This concept allows for a useful 
interpretation of classical clustering approaches, which highlights their limitations. A supe¬ 
rior method is proposed that uses a mixture-based generative model to hierarchically grow 
class boundaries corresponding to different individuals. 

The Generic Shape-Illumination Manifold is evaluated on a large data corpus acquired 
in real-world conditions and its performance is shown to greatly exceed that of state-of- 
the-art methods in the literature and the best performing commercial software. Empirical 
evaluation of the Anisotropic Manifold Space clustering on a popular situation comedy is 
also described with excellent preliminary results. 
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Introduction 


This chapter sets up the main premises of the thesis. We start by defining the problem of 
automatic face recognition and explain why this is an extremely challenging task, followed by 
an overview of its practical significance. An argument for the advantages of face recognition, 
in the context of other biometrics, is presented. Finally, the main assumptions and claims 
of the thesis are stated, followed by a synopsis of the remaining chapters. 


1.1 Problem statement 

This thesis addresses the problem of automatic face recognition from video in realistic imag¬ 
ing conditions. While in some previous work the term “face recognition” has been used for 
any face-based biometric identification, we will operationally define face recognition as clas¬ 
sification of persons by their identity using images acquired in the visible electromagnetic 
spectrum. 

Humans seemingly effortlessly recognize faces on a daily basis, yet the same task has so 
far proved to be of formidable difficulty to automatic methods [Bri04, Bos02, ChaOS, Zha04]. 
A number of factors other than one’s identity influence the way an imaged face appears. 
Lighting conditions, and especially light angle, can drastically change the appearance of 
a face. Facial expressions, including closed or partially closed eyes, also complicate the 
problem, just as head pose and scale changes do. Background clutter and partial occlusions, 
be they artifacts in front of the face (such as glasses), or resulting from hair style change, 
growing a beard or a moustache all pose problems to automatic methods. Invariance to the 
aforementioned factors is a major research challenge, see Figure 1.1. 


1.2 Applications 

The most popularized use of automatic face recognition is in a broad range of security 
applications. These can be typically categorized under either (i) voluntary authentication, 
such as for the purpose of accessing a building or a computer system, or for passport control, 
or (ii) surveillance, for example for identifying known criminals at airports or offenders from 
CCTV footage, see Figure 1.2. 

In addition to its security uses, the rapid technological development we are experiencing 
has created a range of novel promising applications for face recognition. Mobile devices, 
such as PDAs and mobile phones with cameras, together with freely available software 
for video-conferencing over the Internet, are examples of technologies that manufacturers 
are trying to make “aware” of their environment for the purpose of easier, more intuitive 
interaction with the human user. Cheap and readily available imaging devices, such as 
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(a) (b) (c) 


Figure 1.1: The effects of imaging conditions - illumination (a), pose (b) and expression 
(c) ~ on the appearance of a face are dramatic and present the main difficulty to automatic 
face recognition methods. 



(a) Authentication (b) Surveillance 


Figure 1.2: The two most common paradigms for security applications of automatic face 
recognition are (a) authentication and (b) surveillance. It is important to note the drastically 
different data acquisition conditions. In the authentication setup, the imaging conditions are 
typically milder, more control over the illumination setup can be exercised and the user can 
be asked to cooperate to a degree, for example by performing head motion. In surveillance 
environment, the viewpoint and illumination are largely uncontrolled and often extreme, face 
scale can have a large range and image quality is poor. 
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(a) Family photos (b) Video collections 


Figure 1.3: Data organization applications of face recognition are rapidly gaining in im¬ 
portance. Automatic organization and retrieval of (a) amateur photographs or (b) video 
collections are some of the examples. Face recognition is in both cases extremely difficult, 
with large and uncontrolled variations in pose, illumination and expression, further compli¬ 
cated by image clutter and frequent partial occlusions of faces. 


cameras and camcorders, and storage equipment (such as DVDs, flash memory and HDDs) 
have also created a problem in organizing large volumes of visual data. Given that humans 
(and faces) are often at the centre of interest in images and videos, face recognition can 
be employed for content-based retrieval and organization, or even synthesis of imagery, see 
Figure 1.3. 

The increasing commercial interest in face recognition technology is well witnessed by 
the trend of the relevant market revenues, as shown in Figure 1.4. 

1.3 A case for face recognition from video 

Over the years, a great number of biometrics have been proposed and found their use for the 
task of human identification. Examples are fingerprint [Jai97, Mal03], iris [Dau92, Wil94] 
or retinal scan-based methods. Some of these have achieved impressively high identification 
rates [Gra03] (e.g. retinal scan ~ 10“^ error rate [Nat]). 

However, face recognition has a few distinct advantages. In many cases, face information 
may be the only cue available, such as in the increasingly important content-based multime¬ 
dia retrieval applications [AraOGa, Ara05c, Ara06i, Le06, SivOSj. In others, such as in some 
surveillance environments, bad imaging conditions render any single cue insufficient and a 
fusion of several may be needed (e.g. see [Bru95a, Bru95b, Sin04]). Even for access-control 
applications, when more reliable cues are available [Nat], face recognition has the attrac¬ 
tive property of being very intuitive to humans as well as non-invasive, making it readily 
acceptable by wider audiences. Finally, face recognition does not require user cooperation. 
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Video. The nature of many practical applications is such that more than a single image 
of a face is available. In surveillance, for example, the face can be tracked to provide a 
temporal sequence of a moving face. For access-control use of face recognition the user may 
be assumed to be cooperative and hence be instructed to move in front of a fixed camera. 
This is important as a number of technical advantages of using video exist: person-specific 
dynamics can be learnt, or more effective face representations be obtained (e.g. super¬ 
resolution images or a 3D face model) than in the single-shot recognition setup. Regardless 
of the way in which multiple images of a face are acquired, this abundance of information can 
be used to achieve greater robustness of face recognition by resolving some of the inherent 
ambiguities (shape, texture, illumination etc.) of single-shot recognition. 


1.4 Basic premises and synopsis 

The first major premise of work in this thesis is: 

Premise 1 Neither purely discriminative nor purely generative approaches are very suc¬ 
cessful for face recognition in realistic imaging conditions. 

Hence, the development of an algorithm that in a principle manner combines (i) a generative 
model of well-understood stages of image formation with, (ii) data-driven machine learning, 
is one of our aims. Secondly: 
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Premise 2 Face appearance manifolds provide a powerful way of representing video se¬ 
quences (or sets) of faces and allow for a unified treatment of illumination, pose and 
face motion pattern changes. 

Thus, the structure of this work is as follows. In Chapter 2 we review the existing litera¬ 
ture on face recognition and highlight the limitations of state-of-the-art methods, motivating 
the aforementioned premises. Chapter 3 introduces the notion of appearance manifolds and 
proposes a solution to the simplest formulation of the recognition problem addressed in 
this thesis. The subsequent chapters build up on this work, relaxing assumptions about the 
data from which recognition is performed, culminating with the Generic Shape-Illumination 
method in Chapter 9. The two chapters that follow apply the introduced concepts on the 
problem of face-driven content-based video retrieval and propose a novel framework for mak¬ 
ing further use of the available data. A summary of the thesis and its major conclusions are 
presented in Chapter 12. 

1.4.1 List of publications 

The following publications have directly resulted from the work described in this thesis: 

Journal publications 

1. O. Arandjelovic and R. Cipolla. An information-theoretic approach to face recog¬ 
nition from face motion manifolds. Image and Vision Computing (special issue 
on Face Processing in Video Sequences), 24(6):639-647, 2006. 

2. M. Johnson, G. Brostow, J. Shotton, O. Arandjelovic and R. Cipolla. Semantic 
photo synthesis. Computer Graphics Forum, 3(25):407-413, 2006. 

3. O. Arandjelovic and R. Cipolla. Incremental learning of temporally-coherent 
Gaussian mixture models. Society of Manufacturing Engineers (SME) Technical 
Papers, 2, 2006. 

4. T-K. Kim, O. Arandjelovic and R. Gipolla. Boosted manifold principal angles for 
image set-based recognition. Pattern Recognition, 40(9):pages 2475-2484, 2007. 

5. O. Arandjelovic and R. Cipolla. A pose-wise linear illumination manifold model 
for face recognition using video. Computer Vision and Image Understanding, 
113(1):113-125, 2009. 

6. O. Arandjelovic and R. Cipolla. A methodology for rapid illumination-invariant 
face recognition using image processing filters. Computer Vision and Image Un¬ 
derstanding, 113(2):159-171, 2009. 


29 




Introduction 


§1.4 


7. O. Arandjelovic, R. Hammoud and R. Cipolla. Thermal and reflectance based 
personal identification methodology in challenging variable illuminations. Pat¬ 
tern Recognition^ 43(5):1801-1813, 2010. 

8. O. Arandjelovic and R. Cipolla. Achieving robust face recognition from video by 
combining a weak photometric model and a learnt generic face invariant. Pattern 
Recognition, 46(l):9-23, January 2013. 

Conference proceedings 

1. O. Arandjelovic and R. Cipolla. Face recognition from face motion manifolds 
using robust kernel resistor-average distance. In Proc. IEEE Workshop on Eace 
Processing in Video, 5:88, 2004. 

2. O. Arandjelovic and R. Cipolla. An illumination invariant face recognition system 
for access control using video. In Proc. British Machine Vision Conference, pages 
537-546, 2004. 

3. O. Arandjelovic, G. Shakhnarovich, J. Fisher, R. Cipolla, and T. Darrell. Face 
recognition with image sets using manifold density divergence. In Proc. IEEE 
Conference on Computer Vision and Pattern Recognition, 1:581-588, 2005. 

4. O. Arandjelovic and A. Zisserman. Automatic face recognition for film character 
retrieval in feature-length films. In Proc. IEEE Conference on Computer Vision 
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5. T-K. Kim, O. Arandjelovic and R. Cipolla. Learning over sets using boosted 
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6. O. Arandjelovic and R. Cipolla. Incremental learning of temporally-coherent 
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7. O. Arandjelovic and R. Gipolla. A new look at filtering techniques for illumi¬ 
nation invariance in automatic face recognition. In Proc. IEEE Conference on 
Automatic Eace and Cesture Recognition, pages 449-454, 2006. 

8. O. Arandjelovic and R. Cipolla. Face recognition from video using the generic 
shape-illumination manifold. In Proc. European Conference on Computer Vision, 
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9. O. Arandjelovic and R. Cipolla. Automatic cast listing in feature-length films 
with anisotropic manifold space. In Proc. IEEE Conference on Computer Vision 
Pattern Recognition, 2:1513-1520, 2006. 


30 




§1.4 


Introduction 
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Figure 2.1: The main stages of an automatic face recognition system, (i) A face detector is 
used to detect (localize in space and scale) faces in a cluttered scene; this is followed by (ii) 
extraction of features used to represent faces and, finally (Hi) extracted features are compared 
to those stored in a training database to associate novel faces to known individuals. 


Important practical applications of automatic face recognition have made it a very popu¬ 
lar research area of computer vision. This is evidenced by a vast number of face recognition 
algorithms developed over the last three decades and, in recent years, the emergence of a 
number of commercial face recognition systems. This chapter: (i) presents an account of the 
face detection and recognition literature, highlighting the limitations of the state-of-the-art, 
(ii) explains the performance measures used to gauge the effectiveness of the proposed algo¬ 
rithms, and (iii) describes the data sets on which algorithms in this thesis were evaluated. 


2.1 Introduction 

At the coarsest level, the task of automatic, or computer-based, face recognition inherently 
involves three stages: (i) detection/localization of faces in images, (ii) feature extraction and 
(iii) actual recognition, as shown in Figure 2.1. The focus of this thesis, and consequently 
the literature review, is on the last of the three tasks. However, it is important to understand 
the difficulties of the two preceding steps. Any inaccuracy injected in these stages impacts 
the data a system must deal with in the recognition stage. Additionally, the usefulness of 
the overall system in practice ultimately depends on the performance of the entire cascade. 
For this reason we first turn our attention to the face detection literature and review some 
of the most influential approaches. 


2.2 Face detection 

Unlike recognition which concerns itself with discrimination of objects in a category, the task 
of detection is that of discerning the entire category. Specifically, face detection deals with 
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Figure 2.2: An input image and the result using the popular Viola-Jones face detector 
[Vio04]- Detected faces are shown with green square bounding boxes, showing their loca¬ 
tion and scale. One missed detection (i.e. a false negative) is shown with a red bounding 
box. There are no false detections (i.e. false positives). 

Table 2.1: An overview of face detection approaches. 



Face detection approaches 

Input data 

Still images 

Sequences 

Approach 

Ensemble 

Cascade 

Cues 

Greyscale Colour 

Motion 

Depth 

Other 

Representation 

Holistic 

Feature-based 

Hybrid 

Search 

Greedy 

Exhaustive 

Focus of attention 


the problem of determining the presence of and localizing faces in images. Much like face 
recognition, this is complicated by in-plane rotation of faces, occlusion, expression changes, 
pose (out-of-plane rotation) and illumination, see Figure 2.2. 

Face detection technology is fairly mature and a number of reliable face detectors have 
been built. Here we summarize state-of-the-art approaches - for a higher level of detail the 
reader may find it useful to refer to a general purpose review [HjeOl, Yan02c]. 

State-of-the-art methods. Most of the current state-of-the-art face detection methods 
are holistic in nature, as opposed to part-based. While part-based (also know as constellation 
of features) approaches were proposed for their advantage of exhibiting greater viewpoint 
robustness [HeiOO], they have largely been abandoned for complex, cluttered scenes in favour 
of multiple view-specific detectors that treat the face as a whole. Henceforth this should be 
assumed when talking about holistic methods, unless otherwise stated. One such successful 
algorithm was proposed by Rowley et al. [Row98]. An input image is scanned at multiple 
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Figure 2.3: A diagram of the method of Rowley et al. [Row98]. This approach is repre¬ 
sentative of the group of neural network-based face detection approaches: (i) input image is 
scanned at different scales and locations, (ii) features are extracted from the current window 
and are (in) fed into an MLR classifier. 


scales with a neural network classifier which is fed filtered appearance of the current patch, 
see Figure 2.3. Sung and Poggio [Sun98] also employ a Multi-Layer Perceptron (MLP), 
but in a statistical framework, learning “face” and “non-face” appearances as Gaussian 
mixtures embedded in the 361-dimensional image space (19 x 19 pixels). Classification is 
then performed based on the “difference” vector between the appearance of the current 
patch and the learnt statistical models. Much like in [Row98], an exhaustive search over 
the location/scale space is performed. The method of Schneiderman and Kanade [SchOO] 
moves away from greyscale appearance, proposing to use histograms of wavelet coefficients 
instead. An extension to video sequence-based detections was proposed by Mikolajczyk et 
al. in [MikOl] - a dramatic reduction of the search space between consecutive frames was 
achieved by propagating temporal information using the Condensation tracking algorithm 
[lsa98]. 

While achieving good accuracy - see Table 2.2 - the described early approaches suffer 
from tremendous computational overhead. More recent methods have focused on online 
detector efficiency, with attention-based rather than exhaustive search over an entire image. 
The key observation is that the number of face patches in a typical image is usually much 
smaller than the number of non-faces. A hierarchial search that quickly eliminate many un¬ 
promising candidates was proposed in [FerOl]. Simplest and fastest filters are applied first, 
greatly reducing the workload of the subsequent, gradually slower and more complex classi¬ 
fiers. The same principle using support Vector Machines was employed by Romdhani et al. 
[Rom03a]. In [FerOl] Feraud et al. used a variety of cues including colour and motion-based 
filters. A cascaded approach was also employed in the breakthrough method of Viola and 
Jones [Vio04]. This detector, including a number of extensions proposed thereafter, is the 
fastest one currently available (the authors report a speedup of a factor of 15 over [Row98]). 
This is achieved by several means: (i) the attention cascade mentioned previously reduces 
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Table 2.2: A performance summary of state-of-the-art face detection algorithms. Shown is 
the % of detected faces (i.e. true positives), followed by a number of incorrect detections (i.e. 
false positives). 



Data set 

Method 

CMU 

MIT 


(130 images, 507 faces) 

(23 images, 57 faces) 

Feraud et al. [FerOl] 

86.0% / 8 


Garcia-Delakis [Gar04] 

90.3% / 8 

90.1% / 7 

Li et al. [Li02] 

90.2% / 31 


Rowley et al. [Row98] 

86.2% / 23 

84.5% / 8 

Sung-Poggio [Sun98] 


79.9% / 5 

Schneiderman-Kanade [SchOO] 

90.5% / 33 

91.1% / 12 

Viola-Jones [Vio04] 

85.2% / 5 

77.8% / 31 


the number of computationally heavy operations, (ii) it is based on boosting fast weak 
learners [Fre95], and (iii) the proposed integral image representation eliminates repeated 
computation of Haar feature-responses. Improvements to the original detector have since 
been proposed, e.g. by using a pyramidal structure [HuaOS, Li02] for multi-view detection 
and rotation invariance [Hua05, Wu04], or joint low-level features [Mit05] for reducing the 
number of false positive detections. 


2.3 Face recognition 

There are many criterions by which one may decide to cluster face recognition algorithms, 
depending on the focus of discussion, see Table 2.3. For us it will be useful to start by 
talking about the type of data available as input and the conditions in which such data was 
acquired. As was mentioned in the previous chapter, the main sources of changes in one’s 
appearance are the illumination, head pose, image quality, facial expressions and occlusions. 
In controlled imaging conditions, some of or all these variables are fixed so as to simplify 
recognition, and this is known a priori. This is a possible scenario in the acquisition of 
passport photographs, for example. 

Historically, the first attempts at automatic face recognition date back to the early 1970s 
and were able to cope with some success with this problem setup only. These pioneering 
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Table 2.3: An overview of face recognition approaches. 


Face recognition approaches 


Acquisition conditions 

Controlled 

“Loosely” controlled 

Extreme 

Input data 

Still images 

Image sets 

Sequences 

Modality 

Optical data 

Other (IR, range etc.) 

Hybrid 

Representation 

Holistic 

Feature-based 

Hybrid 


Approach 


Appearance-based 


Model-based 


methods relied on predefined geometric features for recognition [Kel70, Kan73]. Distances 
[Bru93] or angles between locations of characteristic facial features (such as the eyes, the 
nose etc.) were used to discriminate between individuals, typically using a Bayes classifier 
[Bru93]. In [Kan73] Kanade reported correct identification of only 15 out of 20 individuals 
under controlled pose. Later Goldstein et al. [Gol72] and, Kaya and Kobayashi [Kay72] 
(also see work by Bledsoe et al. [Ble66, Gha65]) showed geometric features to be sufficiently 
discriminatory if facial features are manually selected. 

Most obviously, geometric feature-based methods are inherently very sensitive to head 
pose variation, or equivalently, the camera angle. Additionally, these methods also suffer 
from sensitivity to noise in the stage of localization of facial features. While geometric 
features themselves are insensitive to illumination changes, the difficulty of their extraction 
is especially prominent in extreme lighting conditions and when the image resolution is low. 

2.3.1 Statistical, appearance-based methods 

In sharp contrast to the geometric, feature-based algorithms are appearance-based methods. 
These revived research interest in face recognition in the early 1990s and to this day are 
dominant in number in the literature. Appearance-based methods, as their name suggests, 
perform recognition directly from the way faces appear in images, interpreting them as or¬ 
dered collections of pixels. Faces are typically represented as vectors in the D-dimensional 
image space, where D is the number of image pixels (and hence usually large). Discrimi¬ 
nation between individuals is then performed by employing statistical models that explain 
inter- and/or intra- personal appearance changes. 

The Eigenfaces algorithm of Turk and Pentland [Tur91a, Tur91b] is the most famous 
algorithm of this group. Its approach was motivated by previous work by Kohonen [Koh77] 
on auto-associative memory matrices for storage and retrieval of face images, and by Kirby 
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(a) Conceptual drawing 



(b) Face space basis as images 


Figure 2.4: (a) A conceptual picture of the Eigenfaces method. All data is projected and then 
classified in the linear subspace corresponding to the dominant modes of variation across 
the entire data set, estimated using PC A. (b) The first 10 dominant modes of the Cam- 
Face data, shown as images. These are easily interpreted as corresponding to the main 
modes of illumination-affected appearance changes (brigher/darker face, strong light from 
the left/right etc.) and pose. 


and Sirovich [Sir87, Kir90]. It uses Principal Component Analysis (PCA) to construct the 
so-called face space - a space of dimension d <Si D that explains appearance variations of 
human faces, see Figure 2.4. Recognition is performed by projecting all data onto the face 
space and classifying a novel face to the closest class. The most common norms used in 
the literature are the Euclidean (also known as L 2 ), Li and Mahalanobis [Dra03, BevOl], 
in part dictated by the availability of training data. 

By learning what appearance variations one can expect across the corpus of all human 
faces and then by effectively reconstructing any novel data using the underlying subspace 
model, the main advantage of Eigenfaces is that of suppressing noise in data [WanOSa]. By 
interlacing the subspace projection with RANSAC [Fis81], occlusion detection and removal 
are also possible [Bla98, Ara05c]. However, inter- and intra-personal appearance variations 
are not learnt separately. Hence, the method is recognized as more suitable for detection 
and compression tasks [Mog95, Kin97] than recognition. 
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A Bayesian extension of Eigenfaces, proposed by Moghaddam et al. [Mog98], improves 
on the original method by learning the mean intra-personal subspace. Recognition decision 
is again cast using the assumption that appearance for each person follows a Gaussian 
distribution, with also Gaussian, but isotropic image noise. 

To address the lack of discriminative power of Eigenfaces, another appearance-based 
subspace method was proposed - the Fisherfaces [YamOO, ZhaOO], named after Fisher’s 
Linear Discriminant Analysis (LDA) that it employs. Under the assumption of isotropi¬ 
cally Gaussian class data, LDA constructs the optimally discriminating subspace in terms 
of maximizing the between to within class scatter, as shown in Figure 2.5. Given suffi¬ 
cient training data, Fisherfaces typically perform better than Eigenfaces [YamOO, Wen93] 
with further invariance to lighting conditions when applied to Fourier transformed images 
[Aka91]. One of the weaknesses of Fisherfaces is that the estimate of the optimal projection 
subspace is sensitive to a particular choice of training images [BevOl]. This finding is im¬ 
portant as it highlights the need for more extensive use of domain specific information. It is 
therefore not surprising that limited improvements were achieved by applying other purely 
statistical techniques on the face recognition task; Independent Component Analysis (IGA) 
[Bae02, Dra03, Bar98c, Bar02]; Singular Value Decomposition (SVD) [Pre92]; Canonical 
Correlation Analysis (CCA) [SunOS]; Non-negative Matrix Factorization (NNMF) [WanOSj. 
Simple linear extrapolation techniques, such as the Nearest Feature Line (NFL) [Li99], Near¬ 
est Feature Plane (NFS) or Nearest Feature Space (NFS) also failed to achieve signihcant 
performance increase using holistic appearance. 

Nonlinear approaches. Promising results of the early subspace methods and new re¬ 
search challenges they uncovered, motivated a large number of related approaches. Some 
of these focused on relaxing the crude assumption that the appearance of a face conforms 
to a Gaussian distribution. The most popular direction employed the kernel approach 
[Sch99, Sch02] with methods such as Kernel Eigenfaces [YanOO, Yan02b], Kernel Fisher¬ 
faces [YanOO], Kernel Principal Angles [Wol03], Kernel RAD [Ara04b, Ara06e], Kernel ICA 
[Yan05] and others (e.g. see [Zho03]). As an alternative to Kernel Eigenfaces, a multi-layer 
perceptron (MLP) neural network with a bottleneck architecture [deM93, Mal98], shown in 
Figure 2.6, was proposed to implement nonlinear PGA projection of faces [Mog02], but has 
since been largely abandoned due to the difficulty to optimally train [Ghe97] ^. The recently 
proposed Isomap [TenOO] and Locally Linear Embedding (LLE) [RowOl] algorithms were 
successfully applied to unfolding nonlinear appearance manifolds [Bai05, Kim05b, Pan06, 
Yan02a], as were piece-wise linear models [Ara05b, Lee03, Pen94]. 

^The reader may be interested in the following recent paper that proposes an automatic way of initializing 
the network weights so that they are close to a good solution [Hin06]. 
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(a) Conceptual drawing 



(b) LDA basis as images 


Figure 2.5: (a) A conceptual picture of the Fisherfaces method. All data is projected and 
then classified in the linear subspace that best separates classes, which are assumed to be 
isotropic and Gaussian, (b) The first 10 most discriminative modes of the CamFace data, 
shown as images. To the human eye these are not as meaningful as face space basis, see 
Figure H.f. 



Figure 2.6: The feed-forward bottleneck neural network structure, used to implement nonlin¬ 
ear projection of faces. Functionality-wise, two distinct parts of the network can be identified: 
(i) layers 1-3 perform compression of data by exploiting the inherently low-dimensional na¬ 
ture of facial appearance variations; (ii) layers j-5 perform classification of the projected 
data. 


Local feature-based methods. While the above-mentioned methods improve on the 
linear subspace approaches by more accurately modelling appearance variations seen in 
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training, they fail to significantly improve on the limited ability of the original methods in 
generalizing appearance to unseen imaging conditions (i.e. illumination, pose and so on) 
[Bae02, Bar98a, GroOl, Nef96, Sha02b]. 

Local feature-based methods were proposed as an alternative to holistic appearance 
algorithms, as a way of achieving a higher degree of viewpoint invariance. Due to the 
smoothness of faces, a local surface patch is nearly planar and its appearance changes 
can be expected to be better approximated by a linear subspace than those of an entire 
face. Furthermore, their more limited spatial extent and the consequent lower subspace 
dimensionality have both computational benefits and are less likely to suffer from the so- 
called curse of dimensionality. 

Trivial extensions such as Eigenfeatures [Abd98, Pen94] and Eigeneyes [CamOO] demon¬ 
strated this, achieving recognition rates better than those of the original Eigenfaces [CamOO]. 
Even better results were obtained using hybrid methods i.e. combinations of holistic and 
local patch-based appearances [Pen94]. 

An influential group of local features-based methods are the Local Eeature Analysis (LEA) 
(or elastic graph matching) algorithms, the most acclaimed of these being Elastic Bunch 
Graph Matching (EBGM) [Arc03, Bol03, Pen96, Wis97, Wis99b]^. LEA methods have 
proven to be amongst the most successful in the literature [HeoOSb] and are employed in 
commercial systems such as Facelt® by Identix [Ide03] (the best performing commercial 
face recognition software in the 2003 Face Recognition Vendor Test [Phi03]). 

The common underlying idea behind these methods is that they combine in a unified 
manner holistic with local features, and appearance information with geometric structure. 
Each face is represented by a graph overlayed on the appearance image. Its nodes corre¬ 
spond to the locations of features used to describe local face appearance, while its edges 
constrain the holistic geometry by representing feature adjacency, as shown in Figure 2.7. To 
establish the similarity between two faces, their elastic graph representations are compared 
by measuring the distortion between their topological configurations and the similarities of 
feature appearances. 

Various LFA methods primarily differ: (i) in how local feature appearances are rep¬ 
resented and (ii) in the way two elastic graphs are compared. In Elastic Bunch Graph 
Matching [Bol03, Sen99, Wis97, Wis99b], for example, Gabor wavelet [Gab88] jets are used 
to characterize local appearance. In part, their use is attractive in this context because the 
functional form of Gabor wavelets closely resembles the response of receptive fields of the 
visual cortex [Dau80, Dau88, Jon87, Mar80], see Eigure 2.8. They also provide powerful 
means of extracting local frequency information, which has been widely employed in various 

^The reader should note that LFA-based algorithms are sometimes categorized as model-based. In this 
thesis the term model-based is used for algorithms that explain the entire observed face appearance. 
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Figure 2.7: An elastic graph successfully fitted to an input image. Its nodes correspond 
to fiducial points, which are used to characterize local, discriminative appearance. Graph 
topology (i.e. fiducial point adjacency constraints) is typically used only in the fitting stage, 
but is discarded for recognition as it is greatly distorted by viewpoint variations. 



Figure 2.8: A 2-dimensional Gabor wavelet, shown as a surface and a corresponding intensity 
image. The wavelet closely resembles the response of receptive fields of the visual cortex are 
provides a trade-off between spatial and frequency localization of the signal (i.e. appearance). 


branches of computer vision for characterizing texture [Hon04, Lee96, Mun05, Pun04]. Local 
responses to multi-scale morphological operators (dilation and erosion) were also successfully 
used as fiducial point descriptors [Kot98, KotOOa, KotOOb, KotOOc]. 

Unlike any of the previous methods, LFA algorithms generalize well in the presence of 
facial expression changes [Phi95, Wis99a]. On the other hand, much like the early geometric- 
feature based methods, significant viewpoint changes pose problems in both the graph fitting 
stage, as well as in recognition, as the projected topological layout of fiducial features changes 
dramatically with out-of-plane head rotation. Furthermore, both wavelet-based and mor¬ 
phological response-based descriptors show little invariance to illumination changes, causing 
a sharp performance decay in realistic imaging conditions (an Equal Error Rate of 25-35% 
was reported in [KotOOc]) and, importantly for the work presented in this thesis, with low 
resolution images [Ste06]. 
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Appearance-based methods — a summary. In closing, purely appearance-based recog¬ 
nition approaches can achieve good generalization to unseen (i) poses and (ii) facial expres¬ 
sions by using local or hybrid local and holistic features. However, they all poorly generalize 
in the presence of large illumination changes. 

2.3.2 Model-based methods 

The success of LFA in recognition across pose and expression can be attributed to the shift 
away from purely statistical pattern classification to the use of models that exploit a priori 
knowledge about the very specific instance of classification that face recognition is. Model- 
based methods take this approach further. They formulate models of image formation with 
the intention of recovering (i) mainly person-specific (e.g. face albedo or shape) and (ii) 
extrinsic, nuisance variables (e.g. illumination direction, or head yaw). The key challenge 
lies in coming up with models for which the parameter estimation problem is not ambiguous 
or ill-conditioned. 

2D illumination models. The simplest generative models in face recognition are used 
for illumination normalization of raw images, as a preprocessing step cascaded with, typi¬ 
cally, appearance-based classification that follows it. Considering the previously-discussed 
observation that the face surface, as well as albedo, are largely smooth, and assuming a 
Lambertian reflectance model, illumination effects on appearance are for the most part 
slowly spatially varying, see Figure 2.9. On the other hand, discriminative person-specific 
information is mostly located around facial features such as the eye, the mouth and the 
nose, which contain discontinuities and give rise to appearance changes of high frequency, 
as illustrated in Figure 2.10 (a). 

It has been applied in the forms of high-pass and band-pass filters [Ara05c, Ara06a, 
Buh94, Fit02], Laplacian-of-Gaussian filters [Adi97, Ara06f|, edge maps [Adi97, Ara06f, 
Gao02, Tak98] and intensity derivatives [Adi97, Ara06f|, to name the few most popular 
approaches, see Figure 2.10 (b) (also see Ghapter 6). Although widely used due to its 
simplicity, numerical efficiency and the lack of assumptions on head pose, the described 
spatial frequency model is universally regarded as insufficiently sophisticated in all but 
mild illumination conditions, struggling with cast shadows and specularities, for example. 
Various modihcations have thus been proposed. In the Self-Quotient Image (SQI) method 
[Wan04a], the mid-frequency, discriminative band is also scaled with local image intensity, 
thus normalizing edge strengths in weakly and strongly illuminated regions of the face. 

A principled treatment of illumination invariant recognition for Lambertian faces, the 
Illumination Cones method, was proposed in [Geo98]. In [Bel96] it was shown that the 
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(a) (b) 


Figure 2.9: (a) A conceptual drawing of the Lambertian reflectance model. Light is reflected 
isotropically, the reflected intensity being proportional to the cosine between the incident light 
direction 1 and the surface normal n. (b) The appearance of a face rendered as a texture-free 
Lambertian surface. 


set of images of a convex, Lambertian object, illuminated by an arbitrary number of point 
light sources at infinity, forms a polyhedral cone in the image space with dimension equal 
to the number of distinct surface normals. Georghiades et al. successfully used this result 
by reilluminating images of frontal faces. The key limitations of their method are (i) the 
requirement of at least 3 images for each novel face, illuminated from linearly independent 
directions and in the same pose, (ii) lack of accounting non-Lambertian effects. These two 
limitations are separately addressed in [Nis06] and [Wol03]. In [Nis06], a simple model of 
diffuse reflections of a generic face is used to iteratively classify face regions as ‘Lambertian, 
no cast shadows’, ‘Lambertian, cast shadow’ and ‘specular’, applying SQI-based normaliza¬ 
tion to each separately. Is important to observe that the success of this method which uses 
a model of only a single (generic) face demonstrates that shape and reflectance similarities 
across individuals can also be exploited so as to improve recognition. The Quotient Image 
(QI) method [Wol03] makes use of this by making the assumption that all human faces 
have the same shape. Using a small (~ 10) bootstrap set of individuals, each in 3 different 
illuminations, it is shown how pure albedo and illumination effects can approximately be 
separated from a single image of a novel face. However, unlike the method of Nishiyama 
and Yamaguchi [Nis06], QI does not deal well with non-Lambertian effects or cast shadows. 


2D combined shape and appearance models. The Active Appearance Model (AAM) 
[Coo98] was proposed for modelling objects that vary in shape and appearance. It has a 
lot of similarity to the older Active Contour Model [Kas87] and the Active Shape Model 
[Coo95, Ham98a, Ham98b] (also see [Scl98]) that model shape only. 

In AAM a deformable triangular (c.f. EBGM) mesh is fitted to an image of a face, see 
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Figure 2.11 (a). This is guided by combined statistical models of shape and shape-free ap¬ 
pearance, so as to best explain the observed image. In [Coo98] linear, PCA models are used, 
see Figure 2.11 (b). Still, AAM parameter estimation is a difficult optimization problem. 
However, given that faces do not vary a lot in either shape or appearance, the structure of 
the problem is similar whenever the minimization is performed. In [Coo98] and [Coo99b], 
this is exploited by learning a linear relationship between the current reconstruction error 
and the model parameter perturbation required to correct it (for variations on the basic 
algorithm also see [Coo02, Sco03]). 

AAMs have been successfully used for face recognition [Edw98b, Fag06], tracking [Dor03] 
and expression recognition [Saa06]. The main limitations of the original method are: (i) 
the sensitivity to illumination changes in the recognition stage, and (ii) occlusion (including 
self-occlusion, due to 3D rotation for example). The latter problem was recently addressed 
by Gross et al. [Gro06], a modification to the original algorithm demonstrating good fitting 
results with extreme pose changes and occlusion. 


3D combined shape and illumination models. The most recent and complex gen¬ 
erative models used for face recognition are largely based on the 3D Morphable Model in¬ 
troduced in [Bla99] which builds up on the previous work on 3D modelling and animation 
of faces [DeG98, DiP91, MT89, Par75, Par82, Par96]. The model consists of albedo values 
at the nodes of a 3-dimensional triangular mesh describing face geometry. Model fitting is 
performed by combining a Gaussian prior on the shape and texture of human faces with 
photometric information from an image [Wal99b]. The priors are estimated from training 
data acquired with a 3D scanner and densely registered using a regularized 3D optical flow 
algorithm [Bla99]. 3D model recovery from an input image is performed using a gradient 
descent search in an analysis-by-synthesis loop. Linear [Rom02] or quadratic [Rom03b] error 
functions have been successfully used. 

An attractive feature of the 3D Morphable Model is that it explicitly models both intrin¬ 
sic and extrinsic variables, respectively: face shape and albedo, and pose and illumination 
parameters, see Figure 2.12 (a). On the other hand, it suffers from convergence problems in 
the presence of background clutter or facial occlusions (glasses or facial hair). Furthermore 
and importantly for the work presented in this thesis, the 3D Morphable Model requires 
high quality image input [Eve04] and struggles with non-Lambertian effects or multiple light 
sources. Finally, nontrivial user intervention is required (localization of up to seven facial 
landmarks and the dominant light direction, see Figure 2.12 (b)), the fitting procedure is 
slow [Vet04] and can get stuck in local minima [Lee04]. 
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2.3.3 Image set and video-based methods. 

Both appearance and model-based methods have been applied to face recognition using 
image set or video sequence matching. In principle, evidence from multiple shots can be used 
to optimize model parameter recovery in model-based methods and reduce the problem of 
local minima [Edw99]. However, an important limitation lies in the computational demands 
of model fitting. Specifically, it is usually too time consuming to optimize model parameters 
over an entire sequence. If, on the other hand, parameter constraints are merely propagated 
from the first frame, the fitting can experience steady deterioration over time, the so-called 
drift. 

More variability in the manner in which still-based algorithms are extended to deal with 
multi-image input is seen amongst appearance-based methods. We have stated that the main 
limitation of purely appearance-based recognition is that of limited generalization ability, 
especially in the presence of greatly varying illumination conditions. On the other hand, a 
promising fact is that data collected from video sequences often contains some variability in 
these parameters. 

Eigenfaces, for example, have been used on a per-image basis, with recognition decision 
cast using majority vote [AraOGe]. A similar voting approach was also successfully used 
with local features in [CamOO], which were extracted by tracking a face using a Gabor 
Wavelet Network [CamOO, KriiOO, Kru02]. In [TorOO] video information is used only in the 
training stage to construct person-specific PCA spaces, self-eigenfaces, while verification was 
performed from a single image using the Distance from Feature Space criterion. Classifiers 
using different eigenfeature spaces were used in [PriOl] and combined using the sum rule 
[Kit98]. Better use of training data is made with various discriminative methods such as 
Fisherfaces, which can be used to estimate database-specific optimal projection [Edw97]. 

An interesting extension of appearance correlation-based recognition to matching sets of 
faces was proposed by Yamaguchi et al. [Yam98]. The so-called Mutual Subspace Method 
(MSM) has since gained considerable attention in the literature. In MSM, linear subspaces 
describing appearance variations within sets or sequences are matched using canonical cor¬ 
relations [Git85, Hot36, Kai74, Oja83]. It can be shown that this corresponds to finding 
the most similar modes of variation between subspaces [Kim07] (see Chapters 6 and 7, and 
Appendix B for more detail and criticism of MSM). A discriminative heuristic extension was 
proposed in [Fuk03] and a more rigourous framework in [Kim06]. This group of methods 
typically performs well when some appearance variation between training and novel input is 
shared [Ara05b, Ara06e], but fail to generalize in the presence of large illumination changes, 
for example [Ara06b]. The same can be said of the methods that use the temporal compo¬ 
nent to enforce prior knowledge on likely appearance changes between consecutive frames. 
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Table 2.4: A qualitative comparison of advantages and disadvantages of the two main groups 
of face recognition methods in the literature. 



Appearance-based 

Model-based 

Advantages 

• Well-understood statistical methods 
can be applied. 

• Can be used for poor quality and low 
resolution input. 

• Explicit modelling and recovery of 
personal and extrinsic variables. 

• Prior, domain-specific knowledge is 
used. 


• Lacking generalization to unseen 

• High quality input is required. 


pose, illumination etc. 

• Model parameter recovery is time- 

bC 

• No (or little) use of domain-specific 

consuming. 


knowledge. 

• Fitting optimization can get stuck in 



a local minimum. 




cc 


• User intervention is often required for 

Q 


initialization. 



• Difficult to model complex illumina- 



tion effects - fitting becomes as ill- 



conditioned problem. 


In the algorithm of Zhou et al. [Zho03] the joint probability distribution of identity and 
motion is modelled using sequential importance sampling, yielding the recognition decision 
by marginalization. Lee et al. [Lee03] approximate face manifolds by a finite number of 
infinite extent subspaces and use temporal information to robustly estimate the operating 
part of the manifold. 


2.3.4 Summary. 

Amongst a great number of developed face recognition algorithms, we’ve seen that two 
drastically different groups of approaches can be identified: appearance-based and model- 
based, see Figure 2.13. The preceding section described the rich variety of methods within 
each group and highlighted their advantages and disadvantages. In closing, we summarize 
these in Table 2.4. 
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2.4 Performance evaluation 

To motivate different performance measures used across the literature, it is useful to first 
consider the most common paradigms in which face matching is used. These are 

• recognition - 1-to-N matching, 

• verification - 1-to-l matching, and 

• retrieval. 

In this context by the term “matching” we mean that the result of a comparison of two face 
representations yields a scalar, numerical score d that measures their dissimilarity. 

Paradigm 1: 1-to-N matching. In this setup novel input is matched to each of the 
individuals in a database of known persons and classified to - recognized as - the closest, 
most similar one. One and only one correct correspondence is assumed to exist. This is 
illustrated in Figure 2.14 (a). 

When 1-to-N matching is considered, the most natural and often used performance 
measure is the recognition rate. We define it as the ratio of the number of correctly assigned 
to the total number of test persons. 

Paradigm 2: 1-to-l matching. In 1-to-l matching, only a single comparison is consid¬ 
ered at a time and the question asked is if two people are the same. This is equivalent to 
thresholding the dissimilarity measure d used for matching, see Figure 2.14 (b). 

Given a particular distance threshold d*, the true positive rate (TPR) pt{d*) is the pro¬ 
portion of intra-personal comparisons that yields distances within the threshold. Similarly, 
the false positive rate (FPR) pf{d*) is the proportion of inter-personal comparisons that 
yields distances within the threshold. As d* is varied, the changes in pt{d*) and Pf{d*) 
are often visualized using the so-called Receiver-Operator Characteristic (ROC) curve, see 
Figure B.5. 

The Egual Error Rate (EER) point of the ROC curve is sometimes used for brevity: 

EER = PfidEER), where: Pf{dEER) = 1 — PtidEER), (2.1) 


see Figure B.5. 

Paradigm 3: retrieval. In the retrieval paradigm the novel person is now a query to 
the database, which may contain several instances of any individual. The result of a query 
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is an ordering of the entire database using the dissimilarity measure d, see Figure 2.14 (c). 
More successful orderings have instances of the query individual first (i.e. with a lower recall 
index). 

From the above, it can seen that the normalized sum of indexes corresponding to in-class 
faces is a meaningful measure of the recall accuracy. We call this the rank ordering score 
and compute it as follows: 


p=l- 


S — m 
M 


( 2 . 2 ) 


where S is the sum of indexes of retrieved in-class faces, and m and M, respectively, the 
minimal and maximal values S and {S — m) can take. 

The score of p = 1.0 corresponds to orderings which correctly cluster all the data (all 
the in-class faces are recalled first), 0.0 to those that invert the classes (the in-class faces are 
recalled last), while 0.5 is the expected score of a random ordering. The average normalized 
rank [Sal83] is equivalent to 1 — p. 


2.4.1 Data 

Most algorithms in this thesis were evaluated on three large data sets of video sequences - 
the CamFace, ToshFace and Face Video Database. These are briefly desribed next. Other 
data, used only in a few specific chapters, is explained in the corresponding evaluation 
sections. The algorithm we used to autmatically extract faces from video is described in 
Appendix C.2. 

The CamFace dataset. This database contains 100 individuals of varying age and eth¬ 
nicity, and equally represented genders. For each person in the database there are 7 video 
sequences of the person in arbitrary motion (significant translation, yaw and pitch, negligi¬ 
ble roll), each in a different illumination setting, see Fig. 2.16 (a) and 2.17, for 10s at lOfps 
and 320 x 240 pixel resolution (face size ~ 60 pixels). For more information see Appendix C 
in which this database is thoroughly described. 

The ToshFace dataset. This database was kindly provided to us by Toshiba Corpora¬ 
tion. It contains 60 individuals of varying age, mostly male Japanese, and 10 sequences per 
person. Each 10s sequence corresponds to a different illumination setting, acquired at lOfps 
and 320 x 240 pixel resolution (face size ~ 60 pixels), see Fig. 2.16 (b). 

The Face Video Database. This database is freely available and described in [Gor05]. 
Briefly, it contains 11 individuals and 2 sequences per person, little variation in illumination. 
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but extreme and uncontrolled variations in pose, acquired for 10-20s at 25fps and 160 x 120 
pixel resolution (face size ~ 45 pixels), see Fig. 2.16 (c). 

2.5 Summary and conclusions 

This chapter finished laying out the foundations for understanding the novelty of this thesis. 
The challenges of face recognition were explored by presenting a detailed account of previous 
research attempts at solving the problem at hand. It was established that both major 
methodologies, discriminative model-based and generative model-based, suffer from serious 
limitations when dealing with data acquired in realistic, practical conditions. 

The second part of the chapter addressed the issue of evaluating and comparing face 
recognition algorithms. We described a number of useful performance measures and three 
large data sets that will be used extensively throughout this work. 
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Y-derivative 


Laplacian 
of Gaussian 


(a) 


(b) 


Figure 2.10: (a) The simplest generative model used for face recognition: imagei 
to consist of the low-freguency band that mainly corresponds to illumination 
frequency band which contains most of the discriminative, personal informati\ 
noise, (b) The results of several most popular image filters operating under m 
of the frequency model. 
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Figure 2.11: (a) Two input images with correct adaptation (top) and the corresponding 
geometrically normalized images (bottom). [DorOS]. (b) First three modes of the AAM 
appearance model in ±3 standard deviations [Kan02] 



(b) Initialization 


Figure 2.12: (a) Simultaneous reconstruction of 3D shape and texture of a new face from 
two images taken under different conditions. In the centre row, the 3D face is rendered on 
top of the input images [Bla99]. (b) 3D Morphable Model Initialization: seven landmarks 
for front and side views and eight for the profile view are manually labelled for each input 
image [LiO)]- 
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Figure 2.13: A summary of reviewed faee recognition trends in the literature. Also see 
Table 2-4 for a comparative summary. 



(c) Retrieval 


Figure 2.14: Three matching paradigms give rise to different performance measures for face 
recognition algorithms. 
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Figure 2.15: The variations in the true positive and false positive rates, as functions of the 
distance threshold, are often visualized using the Receiver-Operator Characteristic (ROC) 
curve, by plotting them against each other. Shown is a family of ROC curves and the Equal 
Error Rate (EER) line. Better performing algorithms have ROC curves closer to the 100% 
true positive and 0% false positive rate. 
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(a) CamFace 



(b) ToshFace 



(c) Face Video DB 


Figure 2.16: Frames from typical video sequences from the 3 databases used for evaluation 
of most recognition algorithms in this thesis. 
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(a) FaceDBlOO 



(b) FaceDB60 


Figure 2.17: (a) Illuminations 1-7 from CamFace data set and (b) illuminations 1-10 from 
ToshFace data set. 
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Manifold Density Divergence 


The preceding two chapters introduced the problem of face recognition from video, placed 
it into the context of biometrics-based identification methods and current practical demands 
on them, in broad strokes describing relevant research with its limitations. In this chapter 
we adopt the appearance-hased recognition approach and set up the grounds for most of 
the material in the chapters that follow by formalizing the face manifold model. The first 
contribution of this thesis is also introduced - the Manifold Density Divergence (MDD) 
algorithm. 

Specifically, we address the problem of matching a novel face video sequence to a set 
of faces containing typical, or expected, appearance variations. We propose a flexible, 
semi-parametric model for learning probability densities confined to highly non-linear but 
intrinsically low-dimensional manifolds. The model leads to a statistical formulation of 
the recognition problem in terms of minimizing the divergence between densities estimated 
on these manifolds. The proposed method is evaluated on the CamFace data set and is 
shown to match the best and outperform other state-of-the-art algorithms in the literature, 
achieving 94% recognition rate on average. 


3.1 Introduction 

Training a system in certain imaging conditions (single illumination, pose and motion pat¬ 
tern) and being able to recognize under arbitrary changes in these conditions can be con¬ 
sidered to be the hardest problem formulation for automatic face recognition. However, in 
many practical applications this is too strong of a requirement. For example, it is often 
possible to ask a subject to perform random head motion under varying illumination condi¬ 
tions. It is often not reasonable, however, to request that the user perform a strictly defined 
motion, assume strictly defined poses or illuminate the face with lights in a specific setup. 
We therefore assume that the training data available to an AFR system is organized in a 
database where a set of images for each individual represents significant (typical) variability 
in illumination and pose, but does not exhibit temporal coherence and is not obtained in 
scripted conditions. 

The test data - that is, the input to an AFR system - also often consist of a set of 
images, rather than a single image. For instance, this is the case when the data is extracted 
from surveillance videos. In such cases the recognition problem can be formulated as taking 
a set of face images from an unknown individual and finding the best matching set in the 
database of labelled sets. This is the recognition paradigm we are concerned with in this 
chapter. 

We approach the task of recognition with image sets from a statistical perspective, as an 
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instance of the more general task of measuring similarity between two probability density 
functions that generated two sets of observations. Specifically, we model these densities 
as Gaussian Mixture Models (GMMs) defined on low-dimensional nonlinear manifolds em¬ 
bedded in the image space, and evaluate the similarity between the estimated densities via 
the Kullback-Leibler divergence. The divergence, which for GMMs cannot be computed in 
closed form, is efficiently evaluated by a Monte Garlo algorithm. 

In the next section, we introduce our model and discuss the proposed method for learn¬ 
ing and comparing face appearance manifolds. Extensive experimental evaluation of the 
proposed model and its comparison to state-of-the-art methods are reported in Section 3.3, 
followed by discussion of the results and a conclusion. 

3.2 Modelling face manifold densities 

Under the standard representation of an image as a raster-ordered pixel array, images of a 
given size can be viewed as points in a Euclidean image space. The dimensionality, U, of 
this space is equal to the number of pixels. Usually D is high enough to cause problems 
associated with the curse of dimensionality in learning and estimation algorithms. However, 
surfaces of faces are mostly smooth and have regular texture, making their appearance quite 
constrained. As a result, it can be expected that face images are confined to a face space, a 
manifold of lower dimension d D embedded in the image space [Bic94]. We next formalize 
this notion and propose an algorithm for comparing estimated densities on manifolds. 

3.2.1 Manifold density model 

The assumption of an underlying manifold subject to additive sensor noise leads to the 
following statistical model: An image x of subject i’s face is drawn from the probability 
density function (pdf) p^^(x) within the face space, and embedded in the image space by 
means of a mapping function —>■ The resulting point in the D-dimensional 

space is further perturbed by noise drawn from a noise distribution (note that the noise 
operates in the image space) to form the observed image X. Therefore the distribution of 
the observed face images of the subject i is given by: 



(3.1) 


Note that both the manifold embedding function / and the density pp on the manifold are 
subject-specific, as denoted by the superscripts, while the noise distribution is assumed 
to be common for all subjects. Eollowing accepted practice, we model by an isotropic. 
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zero-mean Gaussian. Figure 3.1 shows an example of a face image set projected onto a few 
principal components estimated from the data, and illustrates the validity of the manifold 
notion. 



(a) First three PCs 



(b) Second three PCs 


Figure 3.1: A typical manifold of face images in a training (small blue dots) and a test 
(large red dots) set. Data used come from the same person and shown projected to the first 
three (a) and second three (b) principal components. The nonlinearity and smoothness of the 
manifolds are apparent. Although globally quite dissimilar, the training and test manifolds 
have locally similar structures. 


Let the training database consist of sets Si,..., Sk, corresponding to K individuals. 
Si is assumed to be a set of independent and identically distributed (i.i.d.) observations 
drawn from (3.1). Similarly, the input set So is assumed to be i.i.d. drawn from the test 
subject’s face image density p^^\ The recognition task can then be formulated as selecting 
one among K hypotheses, the fc-th hypothesis postulating that The Neyman- 

Pearson lemma [DudOO] states that the optimal solution for this task consists of choosing 
the model under which Sq has the highest likelihood. Since the underlying densities are 
unknown, and the number of samples is limited, relying on direct likelihood estimation is 
problematic. Instead, we use Kullback-Leibler divergence as a “proxy” for the likelihood 
statistic needed in this K-aiy hypothesis test [Sha02a]. 
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Figure 3.2: Description lengths for varying numbers of GMM components for training (solid) 
and test (dashed) sets. The lines show the average plus/minus one standard deviation across 
sets. 


3.2.2 Kullback-Leibler divergence 

The Kullback-Leibler (KL) divergence [Cov91] quantifies how well a particular pdf q{x) 
describes samples from anther pdf p(x): 

DKL{p\\q) = J p(x)log (3.2) 

It is nonnegative and equal to zero iS p = q. Consider the integrand in (3.2). It can be 
seen that the regions of the image space with a large contribution to the divergence are 
those in which p(x) is significant and p(x) ^ 9(x). On the other hand, regions in which 
p(x) is small contribute comparatively little. We expect the sets in the training data to be 
significantly more extensive than the input set, and as a result to have broader support 
than pO), \Ye therefore use DifL(p^°^||p*'*^) as a “distance measure” between training and 
test sets. This expectation is confirmed empirically, see Figure 3.2. The novel patterns not 
represented in the training set are heavily penalized, but there is no requirement that all 
variation seen during training should be present in the novel distribution. 

We have formulated recognition in terms of minimizing the divergence between densities 
on face manifolds. Two problems still remain to be solved. First, since the analytical form 
for neither the densities nor the embedding functions is known, these must be estimated from 
the data. Second, the KL divergence between the estimated densities must be evaluated. 
In the remainder of this section, we describe our solution for these two problems. 

3.2.3 Gaussian mixture models 

Our goal is to estimate the density defined on a complex nonlinear manifold embedded 
in a high-dimensional image space. Global parametric models typically fail to adequately 
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capture such manifolds. We therefore opt for a more flexible mixture model for the 


Gaussian Mixture Model (GMM). This choice has a number of advantages: 

• It is a flexible, semi-parametric model, yet simple enough to allow efficient estimation. 

• The model is generative and offers interpolation and extrapolation of face pattern 
variation based on local manifold structure. 

• Principled model order selection is possible. 

The multivariate Gaussian components of a GMM in our method need not be semantic 
(corresponding to a specific view or illumination) and can be estimated using the Expecta¬ 
tion Maximization (EM) algorithm [DudOO]. The EM is initialized by K-means clustering, 
and constrained to diagonal covariance matrices. As with any mixture model, it is impor¬ 
tant to select an appropriate number of components in order to allow sufficient flexibility 
while avoiding overfitting. This can be done in a principled way with the Minimal Descrip¬ 
tion Length (MDL) criterion [Bar98b]. Briefly, MDL assigns to a model a cost related to 
the amount of information necessary to encode the model and the data given the model. 
This cost, known as the description length, is proportional to the likelihood of the training 
data under that model penalized by the model complexity, measured as the number of free 
parameters in the model. 

Average description lengths for different numbers of components for the data sets used in 
this chapter are shown in Figure 3.2. Typically, the optimal (in the MDL sense) number of 
components for a training manifold was found to be 18, while 5 was typical for the manifolds 
used for recognition. This is illustrated in Figures 3.3, 3.4 and 3.5. 

3.2.4 Estimating KL divergence 

Unlike in the case of Gaussian distributions, the KL divergence cannot be computed in a 
closed form when p(x) and g(x) are GMMs. However, it is straightforward to sample from 
a GMM. The KL divergence in (3.2) is the expectation of the log-ratio of the two densities 
w.r.t. the density p. According to the law of large numbers [Gri92], this expectation can 
be evaluated by a Monte-Garlo simulation. Specifically, we can draw a sample x^ from the 
estimated density p, compute the log-ratio of p and q, and average this over M samples: 



(3.3) 


Drawing from p involves selecting a GMM component and then drawing a sample from 
the corresponding multi-variate Gaussian. Figure 3.4 shows a few examples of samples drawn 
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(b) 


Figure 3.3: Centres of the MDL GMM approximation to a typical training face manifold, 
displayed as images (a) (also see Figure 3.5). These appear to correspond to different 
pose/illumination combinations. Similarly, centres for a typical face manifold used for recog¬ 
nition are shown in (b). As this manifold corresponds to a video infixed illumination, the 
number of Gaussian clusters is much smaller. In this case clusters correspond to different 
poses only: frontal, looking down, up, left and right. 



Figure 3.4: Synthetically generated images from a single Gaussian component in a GMM of 
a training image set. It can be seen that local manifold structure, corresponding to varying 
head pose in fixed illumination, is well captured. 
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Figure 3.5: A training face manifold (blue dots) and the centres of Gaussian clusters of the 
corresponding MDL GMM model of the data (circles), projected on the first three principal 
components. 


in this manner. In summary, we use the following approximation for the KL divergence 
between the test set and the fc-th subject’s training set: 


1 ^ 

Dkl log 

In our experiments we used M = 1000 samples. 




(3.4) 


3.3 Empirical evaluation 

We compared the performance of our recognition algorithm on the GamFace data set to 
that of: 

• KL divergence-based algorithm of Shakhnarovich et al. (Simple KLD) [Sha02a], 

• Mutual Subspace Method (MSM) [Yam98], 

• Constrained MSM (CMSM) [Fuk03] which projects the data onto a linear subspace 
before applying MSM, 

• Nearest Neighbour (NN) in the set distance sense; that is, achieving 
miiixeSo minyeSi d{x, y). 

In Simple KLD, we used a principal subspace that captured 90% of the data variance. 
In MSM, the dimensionality of PCA subspaces was set to 9 [Fuk03], with the first three 
principal angles used for recognition. The constraint subspace dimensionality in CMSM (see 
[Fuk03]) was chosen to be 70. All algorithms were preceded with PCA performed on the 
entire dataset, which resulted in dimensionality reduction to 150 (while retaining 95% of 
the variance). 
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Table 3.1: Recognition accuracy (%) of the various methods using different training/testing 
illumination combinations. 


Method 


MDD 

Simple KLD 

MSM 

CMSM 

Set NN 


mean 

94 

69 

83 

92 

89 

Recognition rate 

std 

8 

5 

10 

7 

9 


In each experiment we used all of the sets from one illumination setup as test inputs and 
the remaining sets as training data, see Appendix C. 


3.3.1 Results 

A summary of the experimental results is shown in Table 3.1. Notice the relatively good 
performance of the simple NN classifier. This supports our intuition that for training, 
even random illumination variation coupled with head motion is sufficient for gathering a 
representative set of samples from the illumination-pose face manifold. 

Both MSM-based methods scored relatively well, with CMSM achieving the best perfor¬ 
mance of all of the algorithms besides the proposed method. That is an interesting result, 
given that this algorithm has not received significant attention in the AFR community; to 
the best of our knowledge, this is the first report of CMSM’s performance on a data set 
of this size, with such illumination and pose variability. On the other hand, the lack of a 
probabilistic model underlying CMSM may make it somewhat less appealing. 

Finally, the performance of the two statistical methods evaluated, the Simple KLD 
method and the proposed algorithm, are very interesting. The former performed worst, 
while the latter produced the highest recognition rates out of the methods compared. This 
suggests several conclusions. Firstly, that the approach to statistical modelling of manifolds 
of faces is a promising research direction. Secondly, it is confirmed that our flexible GMM- 
based model captures the modes of the data variation well, producing good generalization 
results even when the test illumination is not present in the training data set. And lastly, 
our argument in Section 3.2 for the choice of the direction of KL divergence is empirically 
confirmed, as our method performs well even when the subject’s pose is only very loosely 
controlled. 
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3.4 Summary and conclusions 

In this chapter we introduced a new statistical approach to face recognition with image 
sets. Our main contribution is the formulation of a flexible mixture model that is able to 
accurately capture the modes of face appearance under broad variation in imaging condi¬ 
tions. The basis of our approach is the semi-parametric estimate of probability densities 
confined to intrinsically low-dimensional, but highly nonlinear face manifolds embedded 
in the high-dimensional image space. The proposed recognition algorithm is based on a 
stochastic approximation of Kullback-Leibler divergence between the estimated densities. 
Empirical evaluation on a database with 100 subjects has shown that the proposed method, 
integrated into a practical automatic face recognition system, is successful in recognition 
across illumination and pose. Its performance was shown to match the best performing 
state-of-the-art method in the literature and exceed others. 


Related publications 

The following publications resulted from the work presented in this chapter: 

• O. Arandjelovic, G. Shakhnarovich, J. Fisher, R. Cipolla, and T. Darrell. Face recog¬ 
nition with image sets using manifold density divergence. In Proc. IEEE Conference 
on Computer Vision and Pattern Recognition (CVPR), l:pages 581-588, June 2005. 
[Ara05b] 
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Unfolding Face Manifolds 


In the previous chapter we addressed the problem of matching a novel face video sequence 
to a set of faces containing typical, or expected, appearance variations. In this chapter we 
move away from the assumption of having available such a large training corpus and instead 
match a novel sequence against a database which too contains only a single sequence per 
known individual. To solve the problem we propose the Robust Kernel RAD algorithm. 

Following the adopted appearance-based approach, we motivate the use of the Resistor- 
Average Distance (RAD) as a dissimilarity measure between densities corresponding to 
appearances of faces in a single sequence. We then introduce a kernel-based algorithm that 
makes use of the simplicity of the closed-form expression for RAD between two Gaussian 
densities, while allowing for modelling of complex but intrinsically low-dimensional face 
manifolds. Additionally, it is shown how geodesically local appearance manifold structure 
can be modelled, naturally leading to a stochastic algorithm for generalizing to unseen 
modes of data variation. On the CamFace data set our method is demonstrated to exceed 
the performance of state-of-the-art algorithms achieving the correct recognition rate of 98% 
in the presence of mild illumination variations. 


4.1 Dissimilarity between manifolds 

Consider the Kullback-Leibler DKL{,p\\q) divergence employed in Chapter 3. As previ¬ 
ously discussed, the regions of the observation space that produce a large contribution to 
Dkl{p\\ q) are those that are well explained by p(x), but not by (j'(x). The asymmetry of the 
KL divergence makes it suitable in cases when it is known a priori that one of the densities 
p(x) or g(x) describes a wider range of data variation than the other. This is conceptually 
illustrated in Figure 4.1 (a). 

However, in the proposed recognition framework, this is not the case -- pitch and yaw 
changes of a face are expected to be the dominant modes of variation in both training and 
novel data, see Figure 4.2. Additionally, exact head poses assumed by the user are expected 
to somewhat vary from sequence to sequence and the robustness to variations not seen in 
either is desired. This motivates the use of a symmetric “distance” measure. 

4.1.1 Resistor-Average distance. 

We propose to use the Resistor-Average distance (RAD) as a measure of dissimilarity be¬ 
tween two probability densities. It is defined as: 

Drad{p,<i) = [DKLip\\q)~^ + DKLiq\\p)~^] ^ (4.1) 
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(a) 



(b) 

Figure 4.1: A ID illustration of the asymmetry of the KL divergence (a). DklW\\p) is 
an order of magnitude greater than DKL{p\\q) - the “wider” distribution q(x) explains the 
“narrower” p(x) better than the other way round. In (b), Dhad (p, q) is plotted as a function 

of DKL{,p\\q) and DKL{q\\p)- 


Much like the KL divergence from which it is derived, it is nonnegative and equal to zero 
iff p(x) = g(x), but unlike it, it is symmetric. Another important property of the Resistor- 
Average distance is that when two classes of patterns Cp and Cq are distributed according to, 
respectively, p(x) and q{x), DnAoip, q) reflects the error rate of the Bayes-optimal classifier 
between Cp and Cq [JohOl]. 

To see in what manner RAD differs from the KL divergence, it is instructive to consider 
two special cases: when divergences in both directions between two pdfs are approximately 
equal and when one of them is much greater than the other: 

• DKLip\\q) ~ DKLiq\\p) = D 


Drad{p, q) ~ T’/2 


(4.2) 


• DKL{p\\q) > DKL{q\\p) or 
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(a) (b) 


Figure 4.2: A subset of 10 samples from two typical face sets used to illustrate concepts 
addressee in this chapter (top) and the corresponding patterns in the 3D principal component 
subspaces (bottom), estimated from data. The sets capture appearance changes of faces of 
two different individuals as they performed unconstrained head motion in front of a fixed 
camera. The corresponding pattern variations (blue circles) are highly nonlinear, with a 
number of outliers present (red stars). 


DKL(p\\q) < DKL{q\\p) 

DRAoip.q) ~ T[im{DKLip\\q),DKL{q\\p)) (4.3) 


It can be seen that RAD very much behaves like a smooth min of DKL{p\\q) and DKL{q\\p) 
(up to a multiplicative constant), also illustrated in Figure 4.1 (b). 
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4.2 Estimating RAD for nonlinear densities 


Following the choice of the Resistor-Average distance as a means of quantifying the similarity 
of manifolds, we turn to the question of estimating this distance for two arbitrary, nonlinear 
face manifolds. For a general case there is no closed-form expression for RAD. However, 
when p(x) and g(x) are two normal distributions [Yos99]: 


DKLip\\q) =2 logs 



\ 1 r 



V i-^pL 

/ 2 L 




-1 


S (Xq-Xp)(xq-Xp) 


- f (4.4) 


where D is the dimensionality of data, Xp and x^ data means, and Sp and Sg the corre¬ 
sponding covariance matrices. 

To achieve both expressive modelling of nonlinear manifolds as well as an efficient pro¬ 
cedure for comparing them, in the proposed method a nonlinear projection of data using 
Kernel Principal Component Analysis (Kernel PCA) is performed first. We show that with 
an appropriate choice of the kernel type and bandwidth, the assumption of normally dis¬ 
tributed face patterns in the projection space produces good KL divergence estimates. With 
a reference to our generative model in (3.1), an appearance manifold is effectively unfolded 
from the embedding image space. 


4.3 Kernel PCA 

PCA is a technique in which an orthogonal basis transformation is applied such that the 
data covariance matrix C = ((x^ — (xj))(xi — (xj))^) is diagonalized. When data {x^} lies 
on a linear manifold, the corresponding linear subspace is spanned by the dominant (in the 
eigenvalue sense) eigenvectors of C. However, in the case of nonlinearly distributed data, 
PCA does not capture the true modes of variation well. 

The idea behind KPCA is to map data into a high-dimensional space in which it is ap¬ 
proximately linear - then the true modes of data variation can be found using standard PCA. 
Performing this mapping explicitly is prohibitive for computational reasons and inherently 
problematic due to the “curse of dimensionality”. This is why a technique widely known as 
the “kernel trick” is used to implicitly realize the mapping. Let function $ map the original 
data from input space to a high-dimensional feature space in which it is (approximately) 
linear, —>■ A ^ D. In KPCA the choice of mappings $ is restricted to the set 

such that there is a function k (the kernel) for which: 

$(xi)^$(xj) = k{xi,Xj) (4.5) 
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In this case, the principal components of the data in space can be found by performing 
computations in the input, space only. 

Assuming zero-centred data in the feature space (for information on centring data in the 
feature space as well as a more detailed treatment of KPCA see [Sch99]), the problem of 
finding principal components in this space is equivalent to solving the eigenvalue problem: 

Kui = AiUj (4.6) 


where K is the kernel matrix: 

Kj.fe = fc(xj,Xfc) = #(xj)^$(xfc) (4.7) 

The projection of a data point x to the i-th kernel principal component is computed 
using the following expression [Sch99]: 


N 

ai='^ (4.8) 

m—1 

4.4 Combining RAD and kernel PCA 

The variation of face patterns is highly nonlinear (see Figure 4.3 (a)), making the task of 
estimating RAD between two sparsely sampled face manifolds in the image space difficult. 
The approach taken in this work is that of mapping the data from the input, image space 
into a space in which it lies on a nearly linear manifold. As before, we would not like to 
compute this mapping explicitly. Also, note that the inversions of data covariance matrices 
and the computation of their determinants in the expression for the KL divergence between 
two normal distributions (4.4) limit the maximal practical dimensionality of the feature 
space. 

In our method both of these problems are solved using Kernel PCA. The key observation 
is that regardless of how high the feature space dimensionality is, the data has covariance in 
at most N directions, where N is the number of data points. Therefore, given two data sets 
of faces, each describing a smooth manifold, we first hnd the kernel principal components 
of their union. After dimensionality reduction is performed by projecting the data onto the 
first M kernel principal components, the RAD between the two densities, each now assumed 
Gaussian, is computed. Note that the implicit nonlinear map is different for each data set 
pair. The importance of this can be seen by noticing that the intrinsic dimensionality of the 
manifold that both sets lie on is lower than of the manifold that all data in a database lies 


79 




Unfolding Face Manifolds 


§4.5 


on, resulting in its more accurate “unfolding”, see Figure 4.3 (b). 

We estimate covariance matrices in the Kernel PCA space using probabilistic PCA 
(PPCA) [Tip99b]. In short, probabilistic PCA is an extension of the traditional PCA that 
recovers parameters of a linear generative model of data (i.e. the full corresponding covari¬ 
ance matrix), with the assumption of isotropic Gaussian noise: C = VAV^ + al. Note the 
model of noise density in (3.1) that this assumption implies: g^'^\pn{x)) ^ A/’(0,crI), where 

g{i) (/W(x)) = X. 



(b) 


Figure 4.3: A typical face motion manifold in the input, image space exhibits high nonlinear¬ 
ity (a). The “unfolded” manifold is shown in (b). It can be seen that Kernel PCA captures 
the modes of data variation well, producing a Gaussian-looking distribution of patterns, con¬ 
fined to a roughly 2-dimensional space (corresponding to the intrinsic dimensionality of the 
manifold). In both (a) and (b) shown are projections to the first three principal components. 


4.5 Synthetically repopulating manifolds 

In most applications, due to the practical limitations in the data acquisition process, AFR 
algorithms have to work with sparsely populated face manifolds. Furthermore, some modes 
of data variation may not be present in full. Specifically, in the AFR for authentication 
setup considered in this work, the practical limits on how long the user can be expected to 
wait for verification, as well as how controlled his motion can be required to be, limit the 
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possible variations that are seen in both training and novel video sequences. Finally, noise 
in the face localization process increases the dimensionality of the manifolds faces lie on, 
effectively resulting in even less densely populated manifolds. For a quantitative insight, it 
is useful to mention that the face appearance variations present in a typical video sequence 
used in evaluation in this chapter typically lie on a manifold of intrinsic dimensionality of 
3-7, with 85 samples on average. 

In this work, appearance manifolds are synthetically repopulated in a manner that 
achieves both higher manifold sample density, as well as some generalization to unseen 
modes of variation (see work by Martinez [Mar02], and Sung and Poggio [Sun98] for related 
approaches). To this end, we use domain-specific knowledge to learn face transformations in 
a more sophisticated way than could be realized by simple interpolation and extrapolation. 

Given an image of a face, x, we stochastically repopulate its geodesic neighbourhood by 
a set of novel images {x;^}. Under the assumption that the embedding function in (3.1) 
is smooth, geodesically close images correspond to small changes in the imaging parameters 
(e.g. yaw or pitch). Therefore, using the first-order Taylor approximation of the effects of a 
projective camera, the face motion manifold is locally similar to the affine warp manifold of 
X. The proposed algorithm then consists of random draws of a face image x from the data, 
stochastic perturbation of x by a set of affine warps {Aj} and finally, the augmentation of 
data by the warped images - see Figure 4.6. Writing the affine warp matrix decomposed to 
rotation and translation, skew and scaling: 
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/ 1 

k 


/ 

1 + Sx 

0 


A = 

sin0 

cos 9 ty 

0 

1 



0 

1 + Sy 



1 0 

0 1 ) 

1 0 

0 

w 

1 

0 
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1/ 


(4.9) 


in the proposed method, affine transformation parameters 0, G and ty, k, and Sx and Sy 
are drawn from zero-mean Gaussian densities. 


4.5.1 Outlier rejection 

In most cases, automatic face detection in cluttered scenes will result in a considerable 
number of incorrect localizations - outliers. Typical outliers produced by the Viola-Jones 
face detector employed in this chapter are reproduced from Appendix G in Figure 4.5. 

Note that due to the complexity of face manifolds, outliers cannot be easily removed 
in the input space. On the other hand, outlier rejection after Kernel PGA-based manifold 
“unfolding” is trivial. However, a way of computing the kernel matrix robust to the presence 
of outliers is needed. To this end, our algorithm uses RANSAG [Fis81] with an underlying 
Kernel PGA model. The application of RANSAG in the proposed framework is summarized 
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Figure 4.4: The original, input data (dots) and the result of stochastieally repopulating 
the corresponding manifold (circles). A few samples from the dense result are shown as 
images, demonstrating that the proposed method suceessfully captures and extrapolates the 
most significant modes of data variation. 



Figure 4.5: Typical false face detections identified by our algorithm. 


in Figure 4.6. Finally, the Robust Kernel RAD algorithm proposed in this chapter is in its 
entirety shown in Figure 4.7. 


4.6 Empirical evaluation 

We compared the recognition performance of the following methods^ on the CamFace data 
set: 

• KL divergence-based algorithm of Shakhnarovich et al. (Simple KLD) [Sha02a], 

• Simple RAD (based on Simple KLD), 

• Kernelized Simple KLD algorithm (Kernel KLD), 

• Kernel RAD, 

^Methods were reimplemented through consultation with authors. 
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Input: set of observations {x^}, 

KPCA space dimensionality D. 
Output: kernel principal components {ui}. 


1: Initialize best minimal sample 

B = % 

2: RANSAC iteration 

for ft = 0 to LIMIT 

3: Random sample draw 

{y^} ^ {xJ 

4: Kernel PCA 

{nj = KPCA({j/J) 

5: Nonlinear projection 

{xf} {Xi} 

6: Consistent data 

Bit = \filter{DMAH{xi,G) < T)\ 

7: Update best minimal sample 

\Bu\ > \B\ 7 B^Bu 

8: Kernel PCA using best minimal sample 

{ui} = KPCA(B) 


Figure 4.6: RANSAC Kernel PCA algorithm for unfolding face appearance manifolds in the 
presence of outliers. 


• Robust Kernel RAD, 

• Mutual Subspace Method (MSM) [Yam98], 

• Majority vote using Eigenfaces, and 

• Nearest Neighbour (NN) in the set distance sense; that is, achieving 
minxGSo minyeSi l|x - y||2. 

In all KLD and RAD-based methods, 85% of data energy was explained by the principal 
subspaces. In non-kernelized algorithms this typically resulted in the principal subspace 
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Input: sets of observations {aj, {bj. 

KPCA space dimensionality D. 

Output: inter-manifold distance {bi}). 


1: Inliers with RANSAC 

V = {4}AK} = RANSAC({a,},{b,}) 

2: Synthetic data 

5 = {af},{bf} = perturb ((a’^),(b’^)) 

3: RANSAC Kernel PCA 

{ui} = KPCA(VU5) 

4: Nonlinear projection 

{af},{bf}4^(V,5) 

7: Closed-form RAD 

DnADi{^r}Ahr}) 


Figure 4.7: Robust Kernel RAD algorithm summary. 


dimensionality of 16, see Figure 4.8. In MSM, first 3 principal angles were used for recogni¬ 
tion, while the dimensionality of PCA subspaces describing the data was set to 9 [Yam98]. 
In the Eigenfaces method, the 150-dimensional principal subspace used explained ^ 95% 
of data energy. A 20-dimensional nonlinear projection space was used in all kernel-based 
methods with the RBF kernel /c(xi, Xj) = exp — 7 (xj — XjA(xi — xj). The optimal value of 
the parameter 7 was learnt by optimizing the recognition performance on a 20 person train¬ 
ing data set. Note that people from this set were not included in the evaluation reported 
in Section 4.6.1. We used 7 = 0.380 for greyscale images normalized to have pixel values in 
the range [ 0 . 0 , 1 . 0 ]. 

In each experiment we used sets in a single illumination setup, with test and training 
sets corresponding to sequences acquired in two different sessions, see Appendix C. 

4.6.1 Results 

The performance of the evaluated recognition algorithms is summarized in Table 4.1. The 
results suggest a number of conclusions. 

Firstly, note the relatively poor performance of the two nearest neighbour-type methods 
- the Set NN and the Majority vote using Eigenfaces. These can be considered as a proxy for 
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Figure 4.8: Histograms of the dimensionality of the principal suhspace in kernelized (dotted 
line) and non-kernelized (solid line) KL divergence-based methods, across the evaluation 
data set. The corresponding average dimensionalities were found to be ^ A and ~ 16. The 
large difference illustrates the extent of nonlinearity of face motion manifolds. 


gauging the difficulty of the recognition task, seeing that both can be expected to perform 
relatively well if the imaging conditions are not greatly different between training and test 
data sets. An inspection of the incorrect recognitions of these methods offered an interesting 
insight in one of their particular weaknesses, see Figure 4.9 (a). This reaffirms the conclusion 
of [Sim04], showing that it is not only changes in the data acquisition conditions that are 
challenging but also that there are certain intrinsically difficult imaging configurations. 

The Simple KLD method consistently achieved the poorest results. We believe that the 
likely reason for this is the high nonlinearity of face manifolds corresponding to the training 
sets used, caused by near, office lighting used to vary the illumination conditions. This 
is supported by the dramatic and consistent increase in the recognition performance with 
kernelization. This result confirms the first premise of this work, showing that sophisticated 
face manifold modelling is indeed needed to accurately describe variations that are expected 
in realistic imaging conditions. Furthermore, the improvement observed with the use of 
Resistor-Average distance suggests its greater robustness with respect to unseen variations 
in face appearance, compared to the KL divergence. The performance of Kernel RAD 
was comparable to that of MSM, which ranked second-best in our experiments. The best 
performing algorithm was found to be Robust Kernel RAD. Synthetic manifold repopulation 
produced a significant improvement in the recognition rate (of about 10%), the proposed 
method correctly recognizing 98% of individuals. ROC curves corresponding to the methods 
that best illustrate the contributions of this chapter are shown in Figure 4.9 (b), with Robust 
Kernel RAD achieving an Equal Error Rate of 2%. 
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Table 4.1: Results of the comparison of our novel algorithm with existing methods in the 
literature. Shown is the identification rate in %. 
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4.7 Summary and conclusions 

In this chapter we introduced a novel method for face recognition from face appearance 
manifolds due to head motion. In the proposed algorithm the Resistor-Average distance 
computed on nonlinearly mapped data using Kernel PCA is used as a dissimilarity measure 
between distributions of face appearance, derived from video. A data-driven approach to 
generalization to unseen modes of variation was described, resulting in stochastic manifold 
repopulation. Finally, the proposed concepts were empirically evaluated on a database with 
100 individuals and mild illumination variation. Our method consistently achieved a high 
recognition rate, on average correctly recognizing in 98% of the cases and outperforming 
state-of-the-art algorithms in the literature. 


Related publications 

The following publications resulted from the work presented in this chapter: 

• O. Arandjelovic and R. Cipolla. Face recognition from face motion manifolds using 
robust kernel resistor-average distance. In Proc. IEEE Workshop on Eace Processing 
in Video, 5:page 88, June 2004. [Ara04b] 

• O. Arandjelovic and R. Cipolla. An information-theoretic approach to face recognition 
from face motion manifolds. Image and Vision Computing (special issue on Face 
Processing in Video Sequences), 24(6):pages 639-647, June 2006. [Ara06e] 
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Different individuals 
difficult recognition conditions: 



Different individuals 
favourable recognition conditions: 



(a) 



(b) 


Figure 4.9: (a) The most common failure mode of NN-type recognition algorithms is caused 
by “hard” illumination conditions and head poses. The two top images show faces that 
due to severe illumination conditions and semi-profile head orientation look very similar in 
spite of different identities (see [Sim04]) ~ the Set NN algorithm incorrectly classified these 
fames as belonging to the same person. Information from other frames (e.g. the two bottom 
images) is not used to achieve increased robustness, (b) Receiver Operator Characteristic 
(ROC) curves of the Simple KLD, MSM, Kernel KLD and the proposed Robust Kernel RAD 
methods. The latter exhibits superior performance, achieving an Equal Error Rate of 2%. 
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In the preceding chapters we dealt with increasingly difficult formulations of the face 
recognition problem. As restrictions on both training and novel data were relaxed, more 
generalization was required. So far we addressed robustness to pose changes of the user, 
noise contamination and low spatiotemporal resolution of video. In this chapter we start 
exploring the important but difficult problem of recognition in the presence of changing 
illumination conditions in which faces are imaged. 

In practice, the effects of changing pose are usually least problematic and can often 
be overcome by acquiring data over a time period e.g. by tracking a face in a surveillance 
video. As before, we assume that the training image set for each individual contains some 
variability in pose, but is not obtained in scripted conditions or in controlled illumination. 

In contrast, illumination is much more difficult to deal with: the illumination setup is in 
most cases not practical to control and its physics is difficult to accurately model. Biometric 
imagery acquired in the thermal, or near-infrared electromagnetic spectrum, is useful in this 
regard as it is virtually insensitive to illumination changes. On the other hand, it lacks much 
of the individual, discriminating facial detail contained in visual images. In this sense, the 
two modalities can be seen as complementing each other. The key idea behind the system 
presented in this chapter is that robustness to extreme illumination changes can be achieved 
by fusing the two. This paradigm will further prove useful when we consider the difficulty 
of recognition in the presence of occlusion caused by prescription glasses. 


5.1 Face recognition in the thermal spectrum 

A number of recent studies suggest that face recognition in the thermal spectrum offers a few 
distinct advantages over the visible spectrum, including invariance to ambient illumination 
changes [WolOI, Soc03, ProOO, Soc04]. This is due to the fact that a thermal infrared sensor 
measures the heat energy radiation emitted by the face rather than the light reflectance. In 
outdoor environments, and particularly in direct sunlight, illumination invariance only holds 
true to good approximation for the Long-Wave Infrared (LWIR: S-lAfim) spectrum, which 
is fortunately measured by the less expensive uncooled thermal infrared camera technology. 
Human skin has high emissivity in the Long-Wave Infrared (MWIR: 3-5/xrn) spectrum and 
even higher emissivity in the LWIR spectrum making face imagery by and large invariant 
to illumination variations in these spectra. 

Appearance-based face recognition algorithms applied to thermal infrared imaging con¬ 
sistently performed better than when applied to visible imagery, under various lighting con¬ 
ditions and facial expressions [Kon05, Soc02, Soc03, Sel02]. Further performance improve¬ 
ments were achieved using decision-based fusion [Soc03]. In contrast to other techniques. 


91 




Fusing Visual and Thermal Face Biometrics 


§5.1 


Srivastana and Liu [Sri03], performed face recognition in the space of Bessel function pa¬ 
rameters. First, they decompose each infrared face image using Gabor filters. Then, they 
represent the face by modelling the marginal density of the Gabor filter coefficients using 
Bessel functions. This approach has further been improved by Buddharaju et al. [Bud04]. 
Recently, Friedrich and Yeshurun [Fri03] showed that IR-based recognition is less sensitive 
to changes in 3D head pose and facial expression. 

A thermal sensor generates imaging features that uncover thermal characteristics of 
the face pattern. Another advantage of thermal infrared imaging in face recognition is 
the existence of a direct relationship to underlying physical anatomy such as vasculature. 
Indeed, thermal face recognition algorithms attempt to take advantage of such anatomical 
information of the human face as unique signatures. The use of vessel structure for human 
identification has been studied during recent years using traits such as hand vessel patterns 
[Lin04, Im03], finger vessel patterns [Shi04, Miu04] and vascular networks from thermal 
facial images [Pro98]. In [Bud05] a novel methodology that consists of a statistical face 
segmentation and a physiological feature extraction algorithm, and a matching procedure 
of the vascular network from thermal facial imagery has been proposed. 

The downside of employing near infrared and thermal infrared sensors is that glare 
reflections and opaque regions appear in presence of subjects wearing prescription glasses, 
plastic and sun glasses. For a large proportion of individuals the regions around the eyes - 
that is an area of high interest to face recognition systems - become occluded and therefore 
less discriminant [Ara06h, Li07]. 

5.1.1 Multi-sensor based techniques 

In the biometric literature several classifiers have been used to concatenate and consolidate 
the match scores of multiple independent matchers of biometric traits [Gha99] [BY98, Big97, 
Ver99, WanOSb]. In [Bru95a] a HyperBF network is used to combine matchers based on 
voice and face features. Ross and Jain [Ros03] use decision tree and linear discriminant 
classifiers for classifying the match scores pertaining to the face, fingerprint and hand geom¬ 
etry modalities. In [Ros05] three different colour channels of a face image are independently 
subjected to LDA and then combined. 

Recently, several successful attempts have been made to fuse the visual and thermal 
infrared modalities to increase the performance of face recognition [Heo04, Gya04, Soc04, 
Wan04b, Che05, Kon05, Bru95a, Ros03, Ghe03, Heo03a]. Visible and thermal sensors are 
well-matched candidates for image fusion as limitations of imaging in one spectrum seem to 
be precisely the strengths of imaging in the other. Indeed, as the surface of the face and its 
temperature have nothing in common, it would be beneficial to extract and fuse cues from 
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both sensors that are not redundant and yet complementary. 

In [Heo04] two types of visible and thermal fusion techniques have been proposed. The 
first fuses low-level data while the second fuses matching distance scores. Data fusion was 
implemented by applying pixel-based weighted averaging of co-registered visual and thermal 
images. Decision fusion was implemented by combining the matching scores of individual 
recognition modules. 

The fusion at the score level is the most commonly considered approach in the biometric 
literature [Ros06]. Cappelli et al. [CapOO] use a double sigmoid function for score normal¬ 
ization in a multi-biometric system that combines different fingerprint matchers. Once the 
match scores output by multiple matchers are transformed into a common domain they can 
be combined using simple fusion operators such as the sum of scores, product of scores or 
order statistics (e.g., maximum/minimum of scores or median score). Our proposed method 
falls into this category of multi-sensor fusion at the score level. To deal with occlusions 
caused by eyeglasses in thermal imagery, Heo et al. [Heo04] used a simple ellipse fitting 
technique to detect the circle-like eyeglass regions in the IR image and replaced them with 
an average eye template. Using a commercial face recognition system, Facelt [Ide03], they 
demonstrated improvements in face recognition accuracy. Our method differs both in the 
glasses detection stage, which uses a principled statistical model of appearance variation, 
and in the manner it handles detected occlusions. Instead of using the average eye tem¬ 
plate, which carries no discriminative information, we segment out the eye region from the 
infrared data, effectively placing more weight on the discriminative power of the same region 
extracted from the filtered, visual imagery. 


5.2 Method details 

In the sections that follow we explain our system in detail, the main components of which 
are conceptually depicted in Figure 5.1. 

5.2.1 Matching image sets 

As before, in this chapter too we deal with face recognition from sets of images, both in 
the visual and thermal spectrum. We will show how to achieve illumination invariance 
using a combination of simple data preprocessing (Section 5.2.2), a combination of holistic 
and local features (Section 5.2.3) and the fusion of two modalities (see Section 5.2.4). These 
stages normalize for the bulk of appearance changes caused by extrinsic (non person-specific) 
factors. Hence, the requirements for our basic set-matching algorithm are those of (i) some 
pose generalization and (ii) robustness to noise. We compare two image sets by modelling 
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Figure 5.1: Our system consists of three main modules performing (i) data preprocessing and 
registration, (ii) glasses detection and (Hi) fusion of holistic and local face representations 
using visual and thermal modalities. 


the variations within a set using a linear subspace and comparing two subspaces by finding 
the most similar modes of variation within them. 

The face appearance modelling step is a simple application of Principal Component 
Analysis (PCA) without mean subtraction. In other words, given a data matrix d (each 
column representing a rasterized image), the corresponding subspace is spanned by the 
eigenvectors of the matrix C = dd^ corresponding to the largest eigenvalues; we used 5D 
subspaces, as sufficiently expressive to on average explain over 90% of data variation within 
intrinsically low-dimensional face appearance changes in a set. 

We next formally introduce the concept of principal angles and motivate their application 
for face image set comparison. We show that they can be used to efficiently extract the most 
similar appearance variation modes within two sets. 

Principal angles 

Principal, or canonical, angles 0 < 9i < ... < du < (7r/2) between two H-dimensional linear 
subspaces Ui and U 2 are recursively uniquely defined as the minimal angles between any 
two vectors of the subspaces [Hot36]: 

Pi = cos9i = max max uj'vi (5.1) 

uie(7iViec/2 

subject to the orthonormality condition: 

ufui = vfvi = 1, ufuj = vfvj = 0, j = 1,...,* - 1 (5.2) 

We will refer to and as the i-th pair of principal vectors, see Figure 5.2 (a). The 
quantity pi is also known as the i-th canonical correlation [Hot36]. Intuitively, the first pair 
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of principal vectors corresponds to the most similar modes of variation within two linear 
subspaces; every next pair to the most similar modes orthogonal to all previous ones. We 
quantify the similarity of subspaces Ui and 1 / 2 , corresponding to two face sets, by the cosine 
of the smallest angle between two vectors confined to them i.e. pi. 

This interpretation of principal vectors motivates the suitability of canonical correlations 
as a similarity measure when subspaces Ui and U 2 correspond to face images. First, the 
empirical observation that face appearance varies smoothly as a function of camera viewpoint 
[AraOGb, Bic94] is implicitly exploited: since the computation of the most similar modes of 
appearance variation between sets can be seen as an efficient “search” over entire subspaces, 
generalization by means of linear pose interpolation and extrapolation is inherently achieved. 
This concept is further illustrated in Figure 5.2 (b,c). Furthermore, by being dependent on 
only a single (linear) direction within a subspace, by employing the proposed similarity 
measure the bulk of data in each set, deemed not useful in a specific set-to-set comparison, 
is thrown away. In this manner robustness to missing data is achieved. 

An additional appealing feature of comparing two subspaces in this manner is contained 
in its computational efficiency. If Bi and B 2 are orthonormal basis matrices corresponding 
to Ui and C/ 2 , then writing the Singular Value Decomposition (SVD) of the matrix B|’B 2 : 

M = BfB2 = USV"^. (5.3) 

The i-th canonical correlation pi is then given by the f-th singular value of M i.e. and 
the z-th pair of principal vectors and by, respectively, BiU and B 2 V [Bj673]. Seeing 
that in our case M is a 5 x 5 matrix and that we only use the largest canonical correlation, 
Pi can be rapidly computed as the largest eigenvalue of MM^ [Pre92]. 

5.2.2 Data preprocessing &: feature extraction 

The first stage of our system involves coarse normalization of pose and illumination. Pose 
changes are accounted for by in-plane registration of images, which are then passed through 
quasi illumination-invariant image filters. 

We register all faces, both in the visual and thermal domain, to have the salient facial 
features aligned. Specifically, we align the eyes and the mouth due to the ease of detection 
of these features (e.g. see [Ara05c, Ber04, Cri04, Fel05, Tru05]). The 3 point correspon¬ 
dences, between the detected and the canonical features’ locations, uniquely define an affine 
transformation which is applied to the original image. Faces are then cropped to 80 x 80 
pixels, as shown in Figure 5.3. 

Coarse brightness normalization is performed by band-pass filtering the images [AraOSc, 
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Fit02]. The aim is to reduce the amount of high-frequency noise as well as extrinsic ap¬ 
pearance variations confined to a low-frequency band containing little discriminating infor¬ 
mation. Most obviously, in visual imagery, the latter are caused by illumination changes, 
owing to the smoothness of the surface and albedo of faces [Adi97]. 

We consider the following type of a band-pass filter: 

If = I * Ga=Wi — I * Ga=w2! (5.4) 

which has two parameters - the widths Wi and W 2 of isotropic Gaussian kernels. These are 
estimated from a small training corpus of individuals in different illuminations. Figure 5.4 
shows the recognition rate across the corpus as the values of the two parameters are varied. 
The optimal values were found to be 2.3 and 6.2 for visual data; the optimal filter for 
thermal data was found to be a low-pass filter with W 2 = 2.8 (i.e. Wi was found to be very 
large). Examples are shown in Figure 5.5. It is important to note from Figure 5.4 that the 
recognition rate varied smoothly with changes in kernel widths, showing that the method is 
not very sensitive to their exact values, which is suggestive of good generalization to unseen 
data. 

The result of filtering visual data is further scaled by a smooth version of the original 
image: 

ipix, y) = lF(a;, ?/)./(! * Ga=w 2 }, (5.5) 

where ./ represents element-wise division. The purpose of local scaling is to equalize edge 
strengths in dark (weak edges) and bright (strong edges) regions of the face; this is similar to 
the Self Quotient Image of Wang et al. [Wan04a]. This step further improves the robustness 
of the representation to illumination changes, see Section 5.3. 

5.2.3 Single modality-based recognition 

We compute the similarity of two individuals using only a single modality (visual or thermal) 
by combining the holistic face representation described in Section 5.2.2 and a representation 
based on local image patches. These have been shown to benefit recognition in the presence 
of large pose changes [Siv05]. 

As before, we use the eyes and the mouth as the most discriminative regions, by ex¬ 
tracting rectangular patches centred at the detections, see Figure 5.6. The overall similarity 
score is obtained by weighted summation: 

Pv/t = + ^Tn ■ Pm + -(Xh - (Xm) ' Pe , (5.6) 

Holistic contribution Local features contribution 
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where pm, Pe and ph are the scores of separately matching, respectively, the mouth, the eyes 
and the entire face regions, and uJh and uJm the weighting constants. 

The optimal values of the weights were estimated from the offline training corpus. As 
expected, eyes were shown to carry a significant amount of discriminative information, as 
for the visual spectrum we obtained uie = 0.3. On the other hand, the mouth region, highly 
variable in appearance in the presence of facial expression changes, was found not to improve 
recognition (i.e. lOm ~ 0.0). 

The relative magnitudes of the weights were found to be different in the thermal spec¬ 
trum, both the eye and the mouth region contributing equally to the overall score: uJm = 
0.1, LOh = 0.8. Notice the rather insignificant contribution of individual facial features. 
This is most likely due to inherently spatially slowly varying nature of heat radiated by the 
human body. 


5.2.4 Fusing modalities 

Until now we have focused on deriving a similarity score between two individuals given sets 
of images in either thermal or visual spectrum. A combination of holistic and local features 
was employed in the computation of both. However, the greatest power of our system comes 
from the fusion of the two modalities. 

Given p^ and pt, the similarity scores corresponding to visual and thermal data, we 
compute the joint similarity as: 

Pf = ujy{py) ■ pv + [l - ^^vipv)] ■ Pt ■ (5.7) 

^ ^ ^ ^ > 

Optical contribution Thermal contribution 

Notice that the weighting factors are no longer constants, but functions. The key idea is that 
if the visual spectrum match is very good (i.e. py is close to 1.0), we can be confident that 
illumination difference between the two images sets compared is mild and well compensated 
for by the visual spectrum preprocessing of Section 5.2.2. In this case, visual spectrum 
should be given relatively more weight than when the match is bad and the illumination 
change is likely more drastic. The value of uiy{py) can then be interpreted as statistically the 
optimal choice of the mixing coefhcient uty given the visual domain similarity py. Formalizing 
this we can write 


ujy{pv) = argmaxp(w|p«), 

to 


(5.8) 
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or, equivalently 



(5.9) 


Under the assumption of a uniform prior on the degree of visual similarity, p{pv) 


p{a\Pv) oc p{a,py) 


(5.10) 


and 


oJviPv) = argmaxp(w,p„). 


(5.11) 


Learning the weighting function 

The function ujy = w„(p„) is estimated in three stages: first (i) we estimate p{ujy,py), then 


(ii) compute uj{pv) using (5.11) and finally (hi) make an analytic fit to the obtained marginal 


distribution. Step (i) is challenging and we describe it next. 

Iterative density estimate The principal difficulty of estimating p(w„,/3„) is of prac¬ 
tical nature: in order to obtain an accurate estimate (i.e. a well-sampled distribution), a 
prohibitively large training database is needed. Instead, we employ a heuristic alternative. 
Much like before, the estimation is performed using the offline training corpus. 

Our algorithm is based on an iterative incremental update of the density, initialized as 
uniform over the domain U!y,py € [0,1]. We iteratively simulate matching of an unknown 
person against a set gallery individuals. In each iteration of the algorithm, these are ran¬ 
domly drawn from the offline training database. Since the ground truth identities of all 
persons in the offline database are known, for each = kAujy we can compute (i) the ini¬ 
tial visual spectrum similarity of the novel and the corresponding gallery sequences, and 
(ii) the resulting separation S{kAujy) i.e. the difference between the similarities of the test 
set and the set corresponding to it in identity, and that between the test set and the most 
similar set that does not correspond to it in identity. This gives us information about the 
usefulness of a particular value of ujy for observed p^’^. Hence, the density estimate p(w„, pv) 
is then updated at {kAujy^pP’P), k = 1.... We increment it proportionally to 6{kAijjy) after 
passing through a y-axis shifted sigmoid function: 
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where subscript [n] signifies the n-th iteration step and 

sig(a;) = T—(5.13) 

as shown in Figure 5.7 (a). The sigmoid function has the effect of reducing the overly 
confident weight updates for the values of that result in extremely good or bad separations 
S{kAu!y). The purpose of this can be seen by noting that we are using separation as a 
proxy for the statistical goodness of Wu, while in fact attempting to maximize the average 
recognition rate (i.e. the average number of cases for which 6{kAuJv) > 0). 

Figure 6.4 summarizes the proposed offline learning algorithm. An analytic fit to LOy{py) 
in the form (1 + e“)/(l + is shown in Figure 5.7 (b). 

5.2.5 Prescription glasses 

The appeal of using the thermal spectrum for face recognition stems mainly from its invari¬ 
ance to illumination changes, in sharp contrast to visual spectrum data. The exact opposite 
is true in the case of prescription glasses, which appear as dark patches in thermal imagery, 
see Figure 5.5. The practical importance of this can be seen by noting that in the US in 
2000 roughly 96 million people, or 34% of the total population, wore prescription glasses 
[WalOl]. 

In our system, the otherwise undesired, gross appearance distortion that glasses cause 
in thermal imagery is used to help recognition by detecting their presence. If the subject 
is not wearing glasses, then both holistic and all local patches-based face representations 
can be used in recognition; otherwise the eye regions in thermal images are ignored as they 
contain no useful recognition (discriminative) information. 

Glasses detection. We detect the presence of glasses by building representations for the 
left eye region (due to the symmetry of faces, a detector for only one side is needed) with and 
without glasses, in the thermal spectrum. The foundations of our classifier are laid out in 
§5.2.1. Appearance variations of the eye region with out without glasses are represented by 
two 6D linear subspaces estimated from the training data corpus, see Fig. 5.9 for examples 
of training data used for subspace estimations. The linear subspace corresponding to eye 
region patches extracted from a set of thermal imagery of a novel person is then compared 
with “glasses on” and “glasses off” subspaces using principal angles. The presence of glasses 
is deduced when the corresponding subspace results in a higher similarity score. We obtain 
close to flawless performance on our data set (also see §5.3 for description), as shown in 
Fig. 5.10 (a,b). Good discriminative ability of principal angles in this case is also supported 
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by visual inspection of the “glasses on” and “glasses off” subspaces; this is illustrated in 
Fig. 5.10 (c) which shows the first two dominant modes of each, embedded in the 3D principal 
subspace. 

The presence of glasses severely limits what can be achieved with thermal imagery, the 
occlusion heavily affecting both the holistic face appearance as well as that of the eye regions. 
This is the point at which our method heavily relies on decision fusion with visual data, 
limiting the contribution of the thermal spectrum to matching using mouth appearance only 
i.e. setting = 1.0 in (5.6). 


5.3 Empirical evaluation 

We evaluated the described system on the “Dataset 02: IRIS Thermal/Visible Face Database’^ 
subset of the Object Tracking and Classification Beyond the Visible Spectrum (OTCBVS) 
database^, freely available for download at http://www.cse.ohio-state.edu/OTCBVS-BENCH/. 
Briefly, this database contains 29 individuals, 11 roughly matching poses in visual and ther¬ 
mal spectra and large illumination variations (some of these are exemplified in Figure 5.11). 
Images were acquired using the Raytheon Pahn-IR-Pro camera in the thermal and Panasonic 
WV-CP234 camera in the visual spectrum, in the resolution of 240 x 320 pixels. 

Our algorithm was trained using all images in a single illumination in which all 3 salient 
facial features could be detected. This typically resulted in 7-8 images in the visual and 6-7 
in the thermal spectrum, see Figure 5.12, and roughly ±45° yaw range, as measured from 
the frontal face orientation. 

The performance of the algorithm was evaluated both in 1-to-N and 1-to-l matching 
scenarios. In the former case, we assumed that test data corresponded to one of people in 
the training set and recognition was performed by associating it with the closest match. 
Verification (or 1-to-l matching, “is this the same person?”) performance was quantified by 
looking at the true positive admittance rate for a threshold that corresponds to 1 admitted 
intruder in 100. 

5.3.1 Results 

A summary of 1-to-N matching results is shown in Table 11.14. 

Firstly, note the poor performance achieved using both raw visual as well as raw thermal 
data. The former is suggestive of challenging illumination changes present in the OTCBVS 

^lEEE OTCBVS WS Series Bench; DOE University Research Program in Robotics under grant DOE- 
DE-FG02-86NE37968; DOD/TACOM/NAC/ARC Program under grant ROl-1344-18; FAA/NSSA grant 
ROI-1344-48/49; Office of Naval Research under grant #N000143010022. 
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data set. This is further confirmed by significant improvements gained with both band-pass 
filtering and the Self-Quotient Image which increased the average recognition rate for, re¬ 
spectively, 35% and 47%. The same is corroborated by the Receiver-Operator Characteristic 
curves in Figure 5.14 and 1-to-l matching results in Table 5.2. 

On the other hand, the reason for low recognition rate of raw thermal imagery is twofold: 
it was previously argued that the two main limitations of this modality are the inherently 
lower discriminative power and occlusions caused by prescription glasses. The addition 
of the glasses detection module is of little help at this point - some benefit is gained by 
steering away from misleadingly good matches between any two people wearing glasses, 
but it is limited in extent as a very discriminative region of the face is lost. Furthermore, 
the improvement achieved by optimal band-pass filtering in thermal imagery is much more 
modest than with visual data, increasing performance respectively by 35% and 8%. Similar 
increase was obtained in true admittance rate (42% vs. 8%), see Table 5.14. 

Neither the eyes or the mouth regions, in either the visual or thermal spectrum, proved 
very discriminative when used in isolation, see Figure 5.13. Only 10-12% true positive 
admittance was achieved, as shown in Table 5.3. However, the proposed fusion of holistic 
and local appearance offered a consistent and statistically significant improvement. In 1-to-l 
matching the true positive admittance rated increased for 4-6%, while the average correct 
1-to-N matching improved for roughly 2-3%. 

The greatest power of the method becomes apparent when the two modalities, visual 
and thermal, are fused. In this case the role of the glasses detection module is much 
more prominent, drastically decreasing the average error rate from 10% down to 3%, see 
Table 11.14. Similarly, the true admission rate increases to 74% when data is fused without 
special handling of glasses, and to 80% when glasses are taken into account. 


5.4 Summary and conclusions 

In this chapter we described a system for personal identification based on a face biometric 
that uses cues from visual and thermal imagery. The two modalities are shown to com¬ 
plement each other, their fusion providing good illumination invariance and discriminative 
power between individuals. Prescription glasses, a major difficulty in the thermal spectrum, 
are reliably detected by our method, restricting the matching to non-affected face regions. 
Finally, we examined how different preprocessing methods affect recognition in the two spec¬ 
tra, as well as holistic and local feature-based face representations. The proposed method 
was shown to achieve a high recognition rate (97%) using only a small number of training 
images (5-7) in the presence of large illumination changes. 
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Table 5.1: Shown is the average rank-1 reeognition rate using different representations across 
all combinations of illuminations. Note the performance increase with each of the main 
features of our system: image filtering, combination of holistic and local features, modality 
fusion and prescription glasses detection. 


Representation 

Recognition 


Holistic raw data 

0.58 

Visual 

Holistic, band-pass 

Holistic, SQI filtered 

0.78 

0.85 


Mouth-|-eyes-|-holistic 

data fusion, SQI filtered 

0.87 


Holistic raw data 

0.74 

Thermal 

Holistic raw w/ 

glasses detection 

0.77 


Holistic, low-pass filtered 

0.80 


Mouth-|-eyes-|-holistic 

data fusion, low-pass filtered 

0.82 

Proposed thermal + visual fusion 

w/o glasses detection 

w/ glasses detection 

0.90 

0.97 


Table 5.2: A summary of the comparison of different image processing filters for 1 in 100 
intruder acceptance rate. Both the simple band-pass filter, and even further its locally-scaled 
variant, greatly improve performance. This is most significant in the visual spectrum, in 
which image intensity in the low spatial frequency is most affected by illumination changes. 


Representation 

Visual 

Thermal 

1% intruder acceptance 

Unprocessed/raw 

0.2850 

0.5803 

Band-pass filtered (BP) 

0.4933 

0.6287 

Self-quotient image (SQI) 

0.6410 

0.6301 
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Table 5.3: A summary of the results for 1 in 100 intruder acceptance rate. Local features in 
isolation perform very poorly. 


Representation 


Visual (SQI) 

Thermal (BP) 

1% intruder acceptance 

Eyes 


0.1016 

0.2984 

Mouth 


0.1223 

0.3037 

Table 5.4: Holistic & local features - a summary of 1-to-l matching (verification) results. 

Representation 

Visual (SQI) 

Thermal (BP) 

1% intruder acceptance 

Holistic + Eyes 


0.6782 

0.6499 

Holistic + Mouth 


0.6410 

0.6501 

Holistic + Eyes + Mouth 

0.6782 

0.6558 

Table 5.5: Feature and modality fusion - a summary of the 1-to-l ma 
results. 

tching (verification) 

Representation 

True admission rate 

1% intruder acceptance 


Without glasses detection 

0.7435 

With glasses detection 

0.8014 
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(c) 


Figure 5.2: An illustration of the concept of principal angles and principal vectors in the case 
of two 2D subspaces embedded in a 3D space. As two such subspaces necessarily intersect, 
the first pair of principal vectors is the same (i.e. Ui = vi/ However, the second pair is not, 
and in this case forms the second principal angle of cos~^ p 2 = cos“^(0.8084) k. 36°. The top 
three pairs of principal vectors, displayed as images, when the subspaces correspond to image 
sets of the same and different individuals are displayed in (b) and (c) (top rows corresponds 
to Mi, bottom rows to Vi). In (b), the most similar modes of pattern variation, represented 
by principal vectors, are very much alike in spite of different illumination conditions used 
in data acquisition. 
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Figure 5.3: Shown is the original image in the visual speetrum with detected facial features 
marked by yellow circles (left), the result of affine warping the image to the canonical frame 
(centre) and the final registered and cropped facial image. 



(a) Visual (b) Thermal 


Figure 5.4: The optimal combination of the lower and upper band-pass filter thresholds is 
estimated from a small training corpus. The plots show the recognition rate using a single 
modality, (a) visual and (b) thermal, as a function of the widths Wi and W 2 of the two 
Gaussian kernels in (5.4). It is interesting to note that the optimal band-pass filter for the 
visual spectrum passes a rather narrow, mid-freguency band, whereas the optimal filter for 
the thermal spectrum is in fact a low-pass filter. 



(a) Visual (b) Thermal 


Figure 5.5: The effects of the optimal band-pass filters on registered and cropped faces in (a) 
visual and (b) thermal spectra. 
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Figure 5.6: In both the visual and the thermal spectrum our algorithm combines the similar¬ 
ities obtained by matching the holistic face appearance and the appearance of three salient 
local features - the eyes and the mouth. 



(a) j/-axis shifted sigmoid function (b) Weighting function 

Figure 5.7: The contribution of visual matching, as a function of the similarity of visual 
imagery. A low similarity score between image sets in the visual domain is indicative of 
large illumination changes and consequently our algorithm leant that more weight should be 
placed on the illumination-invariant thermal spectrum. 
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Input: visual data dv{person, illumination), 

thermal data dt{person, illumination). 
Output: density estimate p{ujy,pv). 


1: Initialization 

p{UJy,Py) = 0 , 

2: Iteration 

for all illuminations i, j and persons p 


3: Iteration 

for all /e = 0,..., l/Awt,, ujy = kAujy 


5: Separation given w 


5{kAujy) = va\nq^p[ujypP,'P + (1 - ^v)pf‘^ 

-ujypP'‘^ + (1 - '*] 


6: Update density estimate 

p{kAujy,pP'P) = p{kAujy, pP’P) 


sig(C' • S{kAujy)) — 0.5 


7: Smooth the output 

P{^V: Pv) — P{^V: Pv) * Gr(7—0.05 

8: Normalize to unit integral 

p{UJy,Py) = P{ujy, Py) / p(UJy, Py)dpydUJy 


Figure 5.8: The proposed fusion learning algorithm, used offline. 



Figure 5.9: Shown are examples of glasses-on (top) and glasses-off (bottom) thermal data 
used to construct the corresponding appearance models for our glasses detector. 
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(c) Model subspaces 


Figure 5.10: (a) Inter- and (b) intra- class (glasses on and off) similarities across our 
data set. (c) Good discrimination by principal angles is also motivated qualitatively as the 
subspaces modelling appearance variations of the eye region with and without glasses on show 
very different orientations even when projected to the 3D principal subspace. As expected, 
the “glasses off” subspace describes more appearance variation, as illustrated by the larger 
extent of the linear patch representing it in the plot. 
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(b) Thermal 


Figure 5.11: Each row corresponds to an example of a single training (or test) set of images 
used for our algorithm in the (a) visual and (b) thermal spectrum. Note the extreme changes 
in illumination, as well as that in some sets the user is wearing glasses and in some not. 



(a) Visual (b) Thermal 


Figure 5.12: Shown are histograms of the number of images per person used to train our 
algorithm. Depending on the exact head poses assumed by the user we typically obtained 1-8 
visual spectrum images and typically a slightly lower number for the thermal spectrum. The 
range of yaw angles covered is roughly ±45° measured from the frontal face orientation. 
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(a) Eyes (b) Mouth 


Figure 5.13: Isolated local features Receiver-Operator Characteristics (ROC): for visual 
(blue) and thermal (red) spectra. 



(c) Self-Quotient Image filtered 


Figure 5.14: Holistic representations Receiver-Operator Characteristics (ROC) for visual 
(blue) and thermal (red) spectra. 
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Illumination Invariance using Image Filters 


In the previous chapter recognition the invariance to illumination condition was achieved 
by fusing face biometrics acquired in the visual and thermal spectrum. While successful, in 
practice this approach suffers from the limited availability and high cost of thermal imagers. 
We wish to achieve the same using visual data only, acquired with an inexpensive and readily 
available optical camera. 

In this chapter we show that image processed visual data can be used to much the 
same effect as we used thermal data, fusing it with raw visual data. The framework is 
based on simple image processing filters that compete with unprocessed greyscale input to 
yield a single matching score between individuals. It is shown how the discrepancy between 
illumination conditions between novel input and the training data set can be estimated and 
used to weigh the contribution of two competing representations. Evaluated on CamFace, 
ToshFace and Face Video databases, our algorithm consistently demonstrated a dramatic 
performance improvement over traditional filtering approaches. We demonstrate a reduction 
of 50-75% in recognition error rates, the best performing method-filter combination correctly 
recognizing 96% of the individuals. 


6.1 Adapting to data acquisition conditions 

The framework proposed in this chapter is most closely motivated by the findings first 
reported in [Ara06b]. In that paper several face recognition algorithms were evaluated on 
a large database using (i) raw greyscale input, (ii) a high-pass (HP) filter and (iii) the Self- 
Quotient Image (QI) [Wan04a]. Both the high-pass and even further Self Quotient Image 
representations produced an improvement in recognition for all methods over raw grayscale, 
which is consistent with previous Endings in the literature [Adi97, Ara05c, Fit02, Wan04a]. 
Of importance to this work is that it was also examined in which cases these filters help 
and how much depending on the data acquisition conditions. It was found, consistently over 
different algorithms, that recognition rates using greyscale and either the HP or the QI filter 
negatively correlated (with p Ri —0.7), as illustrated in Figure 6.1. 

This is an interesting result: it means that while on average both representations in¬ 
crease the recognition rate, they actually worsen it in “easy” recognition conditions when 
no normalization is needed. The observed phenomenon is well understood in the context 
of energy of intrinsic and extrinsic image differences and noise (see [WanOSa] for a thor¬ 
ough discussion). Higher than average recognition rates for raw input correspond to small 
changes in imaging conditions between training and test, and hence lower energy of extrinsic 
variation. In this case, the two filters decrease the signal-to-noise ratio, worsening the per¬ 
formance. On the other hand, when the imaging conditions between training and test are 
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Figure 6.1: A plot of the performance improvement with HP and QI filters against the 
performance of unprocessed, raw imagery across different illumination combinations used in 
training and test. The tests are shown in the order of increasing raw data performance for 
easier visualization. 


very different, normalization of extrinsic variation is the dominant factor and performance 
is improved, see Figure 6.2 (b). 



(a) Similar acquisition conditions between sequences 



(b) Different acquisition conditions between sequences 


Figure 6.2: A conceptual illustration of the distributions of intrinsic, extrinsic and noise 
signal energies across frequencies in the cases when training and test data acquisition con¬ 
ditions are (a) similar and (b) different, before (left) and after (right) band-pass filtering. 
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Figure 6.3: Distances (0—1) between sets effaces - interpersonal and intrapersonal eom- 
parisons are shown respectively as large red and small blue dots. Individuals are poorly 
separated. 


This is an important observation: it suggests that the performance of a method that 
uses either of the representations can be increased further by detecting the difficulty of 
recognition conditions. In this chapter we propose a novel learning framework to do exactly 
this. 


6.1.1 Adaptive framework 

Our goal is to implicitly learn how similar the novel and training (or gallery) illumination 
conditions are, to appropriately emphasize either the raw input guided face comparisons 
or of its filtered output. Figure 6.3 shows the difficulty of this task: different classes (i.e. 
persons) are not well separated in the space of 2D feature vectors obtained by stacking raw 
and filtered similarity scores. 

Let {Xi, ..., Xn} be a database of known individuals, X novel input corresponding to 
one of the gallery classes and p() and Ff), respectively, a given similarity function and a 
quasi illumination-invariant filter. We then express the degree of belief 77 that two face sets 
X and Xi belong to the same person as a weighted combination of similarities between the 
corresponding unprocessed and filtered image sets: 

77 = (1 - a*)p{X, X,) + a*p{F{X), F{X,)) (6.1) 

In the light of the previous discussion, we want a* to be small (closer to 0.0) when novel 
and the corresponding gallery data have been acquired in similar illuminations, and large 
(closer to 1.0) when in very different ones. We show that a* can be learnt as a function: 

a*=a*{p), ( 6 . 2 ) 
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where /x is the confusion margin — the difference between the similarities of the two Xi most 
similar to A”. As in Chapter 5, we compute an estimate of a* (fa) in a maximum a posteriori 
sense: 

Q;*(/i) = argmaxp(a|/x), (6-3) 

a. 

which, under the assumption of a uniform prior on the confusion margin fjL, reduces to: 

a*(^) = argmaxp(a,/x), (6-4) 

a. 

where p{a,x) is the probability that a is the optimal value of the mixing coefficient. The 
proposed offline learning algorithm entirely analogous to the algorithm described in Sec¬ 
tion 5.2.4, so here we just summarize it in Figure 6.4 with a typical evolution of p{a, p) 
shown in Figure 6.5. The final stage of the offline learning in our method involves imposing 
the monotonicity constraint on a*{p) and smoothing of the result, see Figure 6.6. 
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Input: training data D{person, illumination)^ 

filtered data F{person, illumination), 
similarity function p, 
filter F. 

Output: estimate p(q:,/ i). 


1: Init 

p{a,p) = 0, 

2: Iteration 

for all illuminations i, j and persons p 


3: Initial separation 

(5o = min,^p [p{D{p,i),D{q,j)) - p{D{p,i),D{p,j))] 

4: Iteration 

for all A: = 0,, 1/Aa, a = kAa 

5: Separation given a 

S{kAa) = mmg^p[ap{F{p,i),F{q,j)) 
-ap{F{p,i),F{p,j)) 

+ (1 - a)p{D{p,i),D{q,j)) 

-{1 - a)p{D{p,i),D{p,j))] 

6: Update density estimate 

p{kAa, Sq) = p{kAa, (5o) + S{kAa) 


7 : Smooth the output 

p{a, p) = p{a, p) * G^=o.o 5 

8: Normalize to unit integral 

p{a, p) = p{a, p)/ p{a, x)dxda 


Figure 6.4: Offline training algorithm. 


119 







Illumination Invariance using Image Filters 


§6.1 



(e) Iteration 400 (f) Iteration 500 


Figure 6.5: The estimate of the joint density p{a, through 500 iterations for a band-pass 
filter used for the evaluation of the proposed framework in Section 6.2.1. 
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(a) Raw a* (fi) estimate (b) Monotonic a* (fi) estimate 



(c) Final smooth and monotonic (d) Alpha density map P{a,fj,) 


Figure 6.6: Typical estimates of the a-function plotted against eonfusion margin p (a-c). 
The estimate shown was eomputed using 4 O individuals in 5 illumination conditions for a 
Gaussian high-pass filter. As expected, a* assumes low values for small confusion margins 
and high values for large confusion margins (see (7.8)J. Learnt probability density p{a, p) 
(greyscale surface) and a superimposed raw estimate of the a-function (solid red line) for a 
high-pass filter are shown in (d). 


6.2 Empirical evaluation 

The proposed framework was evaluated using the following filters (illustrated in Figure 6.7): 

• Gaussian high-pass filtered images [AraOSc, Fit02] (HP): 

Xh=X-(X*G,=i,5), (6.5) 

• local intensity-normalized high-pass filtered images - similar to the Self-Quotient Im¬ 
age [Wan04a] (QI): 

Xq =X^^ ./Xi =X^^ ./ (X-Xff), (6.6) 
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Figure 6.7: An example of the original image and the 6 corresponding filtered representations 
we evaluated. 

the division being element-wise, 

• distance-transformed edge map [Ara06b, Can86] (ED): 


=DistanceTransform[X£;] 

(6.7) 

= DistanceTransform [Canny(X)], 

(6.8) 

Laplacian-of-Gaussian [Adi97] (LG): 


Xi = X*VG„=3, 

(6.9) 

where * denotes convolution, and 


directional grey-scale derivatives [Adi97, Eve04] (DX, DY): 


X.=X.Ag,.=. 

(6.10) 

X. = X.|-G,.=a. 

(6.11) 

(6.12) 


To demonstrate the contribution of the proposed framework, we evaluated it with two 
well-established methods in the literature: 

• Constrained MSM (CMSM) [FukOS] used in a state-of-the-art commercial system 
FacePass® [Tos06], and 

• Mutual Subspace Method (MSM) [FukOS]. 

In all tests, both training data for each person in the gallery, as well as test data, consisted 
of only a single sequence. Offline training of the proposed algorithm was performed using 
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40 individuals in 5 illuminations from the CamFace data set. We emphasize that these were 
not used as test input for the evaluations reported in this section. 

6.2.1 Results 

We evaluated the performance of CMSM and MSM using each of the 7 face image represen¬ 
tations (raw input and 6 filter outputs). Recognition results for the 3 databases are shown in 
blue in Figure 6.9 (the results on Face Video data set are tabulated in Figure 6.9 (c), for the 
ease of visualization). Confirming the first premise of this work as well as previous research 
findings, all of the filters produced an improvement in average recognition rates. Little 
interaction between method/filter combinations was found, Laplacian-of-Gaussian and the 
horizontal intensity derivative producing the best results and bringing the best and average 
recognition errors down to 12% and 9% respectively. 

In the last set of experiments, we employed each of the 6 filters in the proposed data- 
adaptive framework. Recognition results for the 3 databases are shown in red in Figure 6.9. 
The proposed method produced a dramatic performance improvement in the case of all fil¬ 
ters, reducing the average recognition error rate to only 4% in the case of CMSM/Laplacian- 
of-Gaussian combination. An improvement in the robustness to illumination changes can 
also be seen in the significantly reduced standard deviation of the recognition. Finally, it 
should be emphasized that the demonstrated improvement is obtained with a negligible 
increase in the computational cost as all time-demanding learning is performed offline. 

6.2.2 Failure modes 

In the discussion of failure modes of the described framework, it is necessary to distin¬ 
guish between errors introduced by a particular image processing filter used, and the fusion 
algorithm itself. As generally recognized across literature (e.g. see [Adi97]), qualitative 
inspection of incorrect recognitions using filtered representations indicates that the main 
difficulties are posed by those illumination effects which most significantly deviate from the 
underlying frequency model (see Section 2.3.2) such as: cast shadows, specularities (espe¬ 
cially commonly observed for users with glasses) and photo-sensor saturation. 

On the other hand, any failure modes of our fusion framework were difficult to clearly 
identify, due to such a low frequency of erroneous recognition decisions. Even these were 
in virtually all of the cases due to overly confident decisions in the filtered pipeline. Over¬ 
all, this makes the methodology proposed in this chapter extremely promising as a robust 
and efficient way of matching face appearance image sets, and suggests that future work 
should concentrate on developing appropriately robust image filters that can deal with more 
complex illumination effects. 


123 




Illumination Invariance using Image Filters 


§6.3 



(c) 


Figure 6.8: (a) The first pair of principal vectors (top and bottom) corresponding to the 
sequences (b) and (c) (every detection is shown for compactness), for each of the 7 
representations used in the empirical evaluation described in this chapter. A higher degree 
of similarity between the two vectors indicates a greater degree of illumination invariance of 
the corresponding filter. 
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6.3 Summary and conclusions 

In this chapter we described a novel framework for increasing the robustness of simple image 
filters for automatic face recognition in the presence of varying illumination. The proposed 
framework is general and is applicable to matching face sets or sequences, as well as single 
shots. It is based on simple image processing filters that compete with unprocessed greyscale 
input to yield a single matching score between individuals. By performing all numerically 
consuming computation offline, our method both (i) retains the matching efficiency of sim¬ 
ple image filters, but (ii) with a greatly increased robustness, as all online processing is 
performed in the closed-form. Evaluated on a large, real-world data corpus, the proposed 
method was shown to dramatically improve video-based recognition across a wide range of 
illumination, pose and face motion pattern changes. 


Related publications 

The following publications resulted from the work presented in this chapter: 

• O. Arandjelovic and R. Cipolla. A new look at filtering techniques for illumination 
invariance in automatic face recognition. In Proc. IEEE Conference on Automatic 
Eace and Gesture Recognition (FGR), pages 449-454, April 2006. [Ara06f] 

• G. Brostow, M. Johnson, J. Shotton, O. Arandjelovic, V. Kwatra and R. Cipolla. 
Semantic photo synthesis. In Proc. Eurographics, September 2006. [Bro06] 
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(a) CamFace 



(b) ToshFace 



RW 

HP 

QI 

ED 

LG 

DX 

DY 

MSM 

0.00 

0.00 

0.00 

0.00 

9.09 

0.00 

0.00 

MSM-AD 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

CMSM 

0.00 

9.09 

0.00 

0.00 

0.00 

0.00 

0.00 

CMSM-AD 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 


(c) Face Video Database, mean error (%) 


Figure 6.9: Error rate statistics. The proposed framework (-AD suffix) dramatically improved 
recognition performance on all method/filter combinations, as witnessed by the reduction in 
both error rate averages and their standard deviations. 
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Joseph M. W. Turner. Snow Storm: Steamboat off a Harbour’s Mouth 
1842, Oil on Canvas, 91.4 x 121.9 cm 
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The method introduced in the previous chapter suffers from two major drawbacks. 
Firstly, the image formation model implicit in the derivation of the employed quasi-illumination| 
invariant image filters is too simplistic. Secondly, illumination normalization is performed on 
a frame-by-frame basis, not exploiting in fullness all the available data from a head motion 
sequence. 

In this chapter we focus on the latter problem. We return to considering face appearance 
manifolds and identify a manifold illumination invariant. We show that under the assump¬ 
tion of a commonly used illumination model by which illumination effects on the appearance 
are slowly spatially varying, tangent planes of the manifold retain their orientation under 
the set of transformations caused by face illumination changes. To exploit the invariant, we 
propose a novel method based on comparisons between linear subspaces corresponding to 
linear patches, piece-wise approximating appearance manifolds. In particular, there are two 
main areas of novelty: (i) we extend the concept of principal angles between linear subspaces 
to manifolds with arbitrary nonlinearities; (ii) it is demonstrated how boosting can be used 
for application-optimal principal angle fusion. 


7.1 Manifold illumination invariants 

Let us start by formalizing our recognition framework. Let x be an image of a face and 
X G where -D is the number of pixels in the image and the corresponding image 
space. Then f(x, 0) is an image of the same face after the rotation with parameter 0 G 
(yaw, pitch and roll). Function f is a generative function of the corresponding face motion 
manifold, obtained by varying 0^. 

Rotation affected appearance changes. Now, consider the appearance change of a 
face due to small rotation A0: 

Ax = f(x,A0)-x. (7.1) 

For small rotations, geodesic neighbourhood of x is linear and using Taylor’s theorem we 
get: 


f(x, A0) - X R:! f(x, 0) -h Vf|(x,o) • A0 - X. (7.2) 


^As a slight digression, note that strictly speaking, f should be person-specific. Due to self-occlusion of 
parts of the face, f cannot produce plausible images of rotated faces simply from a single image x. However, 
in our work, the range of head rotations is sufficiently restricted that under the standard assumption of face 
symmetry [Ara05c], f can be considered generic. 
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Figure 7.1: Under the assumption that illumination effects on the appearance faces are 
spatially slowly varying, appearance manifold tangent planes retrain their orientation in the 
image space with changes in lighting conditions. 


where Vf|(x,®) is the Jacobian matrix evaluated at (x, 0). Noting that f(x,0) = x and 
writing x as a sum of its low and high frequency components x = x/^ + '■ 

Ax «Vf|x,o • A0 = Vflx^.o • A0 + VfIxK.o • A0 (7.3) 

But xl is by definition slowly spatially varying and therefore: 

||Vf|x^.o • A0|| < ||Vf|xK,o • A0||, (7.4) 

and 

Ax« Vf|x„,o-A0. (7.5) 

It can be seen that Ax is a function of the person-specific hh but not the illumination 
affected xl- Hence, the directions (in K-^) of face appearance changes due to small head 
rotations form a local manifold invariant with respect to illumination variation, see Fig¬ 
ure 7.1. 

The manifold illumination invariant we identified explicitly motivates the use of principal 
angles between tangent planes as a similarity measure between manifolds. We now address 
two questions that remain: 

• given principal angles between two tangent planes, what contribution should each 
principal angle have, and 

• given similarities between different tangent planes of two manifolds, how to obtain a 
similarity measure between the manifolds themselves. 

We now turn to the first of these problems. 
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7.2 Boosted principal angles 


In general, each principal angle 9i carries some information for discrimination between 
the corresponding two subspaces. We use this to build simple weak classifiers M{9i) = 
sign[cos(0i) — C]. In the proposed method, these are combined using the now acclaimed 
AdaBoost algorithm [Fre95]. In summary, AdaBoost learns a weighting {wi} of decisions 
cast by weak learners to form a classifier Ai{Q): 


■ N 


AI(0) = sign 


y^WiM{9i) 

_i=l 



2=1 


(7.6) 


In an iterative update scheme classifier performance is optimized on training data which 
consists of in-class and out-of-class features (i.e. principal angles). Let the training database 
consist of sets Si,..., Sk = {*5'^}, corresponding to K classes. In the framework described, 
the K(K —l)/2 out-of-class principal angles are computed between pairs of linear subspaces 
corresponding to training data sets {5^}, estimated using Principal Component Analysis 
(PCA). On the other hand, the K in-class principal angles are computed between a pair of 
randomly drawn subsets for each Si. 

We use the learnt weights {wi} for computing the following similarity measure between 
two linear subspaces: 


/(©) 


1 cos(6»i) 


(7.7) 


A typical set of weights {wi} we obtained is shown graphically in Figure 7.3 (a). The 
plot shows an interesting result: the weight corresponding to the first principal angle is not 
the greatest. Rather it is the second principal angle that is most discriminating, followed by 
the third one. This shows that the most similar mode of variation across two subspaces can 
indeed be due an extrinsic factor. Figure 7.2 (b) shows the 3 most discriminating principal 
vector pairs selected by our algorithm for data incorrectly classified by MSM - the most 
weighted principal vectors are now much less similar. The gain achieved with boosting is also 
apparent from Figure 7.3 (b). A significant improvement can be seen both for a small and 
a large number of principal angles. In the former case this is because our algorithm chooses 
not the first but the most discriminating set of angles. The latter case is practically more 
important - as more principal angles are added to MSM, its performance first improves, 
but after a certain point it starts worsening. This highly undesirable behaviour is caused 
by effectively equal weighting of base classifiers in MSM. In contrast, the performance of 
our algorithm never decreases as more information is added. As a consequence, no special 
provision for choosing the optimal number of principal angles is needed. 
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(a) (b) (c) 


Figure 7.2: (a) The first 3 principal vectors between two linear subspaces which MSM incor¬ 
rectly classifies as corresponding to the same person. In spite of different identities, the most 
similar modes of variation are very much alike and can be seen to correspond to especially 
difficult illuminations, (b) Boosted Principal Angles (BPA), on the other hand, chooses dif¬ 
ferent principal vectors as the most discriminating - these modes of variation are now less 
similar between the two sets, (c) Modelling of nonlinear manifolds corresponding to the two 
image sets produces a further improvement. Shown are the most similar modes of variation 
amongst all pairs of linear manifold patches. Local information is well captured and even 
these principal vectors are now very dissimilar. 
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(a) Optimal angle weighting 



(b) Performance improvement 


Figure 7.3: (a) A typical set of weights corresponding to weak principal angle-based classi¬ 
fiers, obtained using AdaBoost. This figure confirms our criticism of MSM-based methods 
for (i) their simplistic fusion of information from different principal angles and (ii) the use 
of only the first few angles, (b) The average performance of a simple MSM classifier and 
our boosted variant. 
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At this point it is worthwhile mentioning the work of Maeda et al. [Mae04] in which the 
third principal angle was found to be useful for discriminating between sets of images of a 
face and its photograph. Much like in MSM and CMSM, the use of a single principal angle 
was motivated only empirically - the framework described in this chapter can be used for a 
more principled feature selection in this setting as well. 


7.3 Nonlinear subspaces 


Our aim is to extend the described framework of boosted principal angles to being able to 

effectively capture nonlinear data behaviour. We propose a method that combines global 

manifold variations with more subtle, local ones. 

Without the loss of generality, let Si and S 2 be two sets of face appearance images and 0 

the set of principal angles between two linear subspaces. We derive a measure of similarity p 

between Si and S 2 by comparing the corresponding linear subspaces Ui ^2 and locally linear 
(i) 

patches L\ ^ corresponding to piece-wise linear approximations of manifolds of Si and 5*2: 


p{Si.S2) 


{l-a)fG[Q{Ui,U2)] 

' -V-' 

Global manifold similarity contribution 


a max 




Local manifold similarity contribution 


(7.8) 


where fa and have the same functional form as / in (7.7), but separately learnt base 
classifier weights {wi}. Put in words, the proximity between two manifolds is computed as 
a weighted average of the similarity between global modes of data variation and the best 
matching local behaviour. The two terms complement each other: the former provides (i) 
robustness to noise, whereas the latter ensures (ii) graceful performance degradation with 
missing data (e.g. unseen poses) and (iii) illumination invariance, see Figure 7.2 (c). 


Finding stable locally linear patches 

In the proposed framework, stable locally linear manifold patches are found using Mixtures 
of Probabilistic PCA (PPGA) [Tip99a]. The main difficulty in fitting of a PPCA mixture 
is the requirement for the local principal subspace dimensionality to be set a priori. We 
solve this problem by performing the fitting in two stages. In the first stage, a Gaussian 
Mixture Model (GMM) constrained to diagonal covariance matrices is fitted first. This 
model is crude as it is insufficiently expressive to model local variable correlations, yet too 
complex (in terms of free parameters) as it does not encapsulate the notion of intrinsic 
manifold dimensionality and additive noise. However, what it is useful for is the estimation 
of the intrinsic manifold dimensionality d, from the eigenspectra of its covariance matrices, 
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(a) Average eigenspectrum 



(b) PPCA mixture fitting 


Figure 7.4: (a) Average eigenspectrum of diagonal covariance matrices in a typical interme¬ 
diate GMM fit. The approximate intrinsic manifold dimensionality can he seen to be around 
10. (b) Description length as a function of the number of Gaussian components in the in¬ 
termediate and final, PPCA-hased GMM fitting on a typical data set. The latter results in 
fewer components and a significantly lower MDL. 


see Figure 7.4 (a). Once d is estimated (typically d D), the fitting is repeated using a 
Mixture of PPCA. 

Both the intermediate diagonal and the final PPCA mixtures are estimated using the Ex¬ 
pectation Maximization (EM) algorithm [DudOO] which is initialized by K-means clustering. 
Automatic model order selection is performed using the well-known Minimum Description 
Length (MDL) criterion [DudOO], see Figure 7.4 (b). Typically, the optimal (in the MDL 
sense) number of components for face data sets used in Section 8.6 was 3. 


7.4 Empirical evaluation 


Methods in this chapter were evaluated on the GamPace data set, see Appendix C. We 
compared the performance of our algorithm, without and with boosted feature selection 
(respectively MPA and BoMPA), to that of: 
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• KL divergence algorithm (KLD) of Shakhnarovich et al. [Sha02a]^, 

• Mutual Subspace Method (MSM) of Yamaguchi et al. [Yam98]^, 

• Kernel Principal Angles (KPA) of Wolf and Shashua [WolOS]^, and 

• Nearest Neighbour (NN) in the Hausdorff distance sense in (i) LDA [Bel97] and (ii) 
PCA [Tur91a] subspaces, estimated from data. 

In KLD 90% of data energy was explained by the principal subspace used. In MSM, the 
dimensionality of PCA subspaces was set to 9 [FukOS]. A sixth degree monomial expansion 
kernel was used for KPA [WolOS]. In BoMPA, we set the value of parameter a in (7.8) 
to 0.5. All algorithms were preceded with PCA estimated from the entire training dataset 
which, depending on the illumination setting used for training, resulted in dimensionality 
reduction to around 150 (while retaining 95% of data energy). 

In each experiment we used performed training using sequences in a single illumination 
setup and tested recognition with sequences in each different illumination setup in turn. 

7.4.1 BoMPA implementation 

From a practical stand, there are two key points in the implementation of the proposed 
method: (i) the computation of principal angles between linear subspaces and (ii) time ef¬ 
ficiency. These are now briefly summarized for the implementation used in the evaluation 
reported in this chapter. We compute the cosines of principal angles using the method of 
Bjorck and Golub [Bj673], as singular values of the matrix Bf where B 12 are orthonormal 
basis of two linear subspaces. This method is numerically more stable than the eigenvalue 
decomposition-based method used in [Yam98] and with roughly the same computational 
demands, see [Bj673] for a thorough discussion on numerical issues pertaining to the com¬ 
putation of principal angles. A computationally far more demanding stage of the proposed 
method is the PPCA mixture estimation. In our implementation, a significant improvement 
was achieved by dimensionality reduction using the incremental PCA algorithm of Hall et 
al. [HalOO]. Finally, we note that the proposed model of pattern variation within a set 
inherently places low demands on storage space. 

7.4.2 Results 

The performance of evaluated recognition algorithms is summarized in Table 7.1. Firstly, 
note the relatively poor performance of the two nearest neighbour-type methods - the Haus- 

^The algorithm was reimplemented through consultation with the authors. 

®We used the original authors’ implementation. 
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Table 7.1: The mean recognition rate and its standard deviation across different training/test 
illuminations (in %). The last row shows the average time in seconds for 100 set compar¬ 
isons. 


Method 

KLD 

NN-LDA 

NN-PCA 

MSM 

KPA 

MPA 

BoMPA 


mean 

19.8 

40.7 

44.6 

84.9 

89.1 

89.7 

92.6 

Recognition 

std 

9.7 

6.6 

7.9 

6.8 

10.1 

5.5 

4.3 


time 

7.8 

11.8 

11.8 

0.8 

45 

7.0 

7.0 


dorff NN in PCA and LDA subspaces. These can be considered as proxies for gauging the 
difficulty of the recognition task, seeing that both can be expected to perform relatively 
well if the imaging conditions do not greatly differ between training and test data sets. 
Specifically, LDA-based methods have long been established in the single-shot face recogni¬ 
tion literature, e.g. see [Bel97, Zha98, Sad04, Wan04c, Kim05b]. The KL-divergence based 
method achieved by far the worst recognition rate. Seeing that the illumination conditions 
varied across data and that the face motion was largely unconstrained, the distribution of 
intra-class face patterns was significant making this result unsurprising. This is consistent 
with results reported in the literature [Ara05b]. 


The performance of the four principal angle-based methods confirms the premises of 
our work. Basic MSM performed well, but worst of the four. The inclusion of nonlinear 
manifold modelling, either by using the “kernel trick” or a mixture of linear subspaces, 
achieved an increase in the recognition rate of about 5%. While the difference in the average 
performance of MPA and the KPA methods is probably statistically insignificant, it is worth 
noting the greater robustness to specific imaging conditions of our MPA, as witnessed by a 
much lower standard deviation of the recognition rate. Further performance increase of 3% 
is seen with the use of boosted angles, the proposed BoMPA algorithm correctly recognizing 
92.6% of the individuals with the lowest standard deviation of all methods compared. An 
illustration of the improvement provided by each novel step in the proposed algorithm is 
shown in Figure 7.5. Finally, its computational superiority to the best performing method 
in the literature. Wolf and Shashua’s KPA, is clear from a 7-fold difference in the average 
recognition time. 
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(a) Per-case rank-Af performance 



(b) Average rank-Af performance 


Figure 7.5: Shown is the improvement in rank-N recognition accuracy of the basic MSM, 
MPA and BoMPA algorithms for (a) each training/test combination and (b) on average. A 
consistent and significant improvement is seen with nonlinear manifold modelling, which is 
further increased using boosted principal angles. 
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7.5 Summary and conclusions 

In this chapter we showed how appearance manifolds can be used to integrate information 
across a face motion video sequence to achieve illumination invariance. This was done by 
combining (i) an illumination model, and (ii) observed appearance changes, to derive a 
manifold illumination invariant. A novel method, the Boosted Manifold Principal Angles 
(BoMPA), was proposed to exploit the invariant. We used a boosting framework by which 
focus is put on the most discriminative regions of invariant tangent planes and introduced 
a method for fusing their similarities to obtain the overall manifold similarity. The method 
was shown to be successful in recognition across large changes in illumination. 


Related publications 

The following publications resulted from the work presented in this chapter: 

• T-K. Kim, O. Arandjelovic and R. Cipolla. Learning over Sets using Boosted Mani¬ 
fold Principal Angles (BoMPA). In Proc. I APR British Machine Vision Conference 
(BMVC), 2:pages 779-788, September 2005. [Kim05a] 

• O. Arandjelovic and R. Cipolla. Face set classification using maximally probable mu¬ 
tual modes. In Proc. IEEE International Conference on Pattern Recognition (ICPR), 
pages 51I-5I4, August 2006. [Ara06c] 

• T-K. Kim, O. Arandjelovic and R. Cipolla. Boosted manifold principal angles for 
image set-based recognition. Pattern Recognition, 40(9):2475-2484, September 2007. 
[Kim07] 
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In the proceeding chapters, illumination invariance was achieved largely by employing 
a priori domain knowledge, such as the smoothness of faces and their largely Lambertian 
reflectance properties. Subtle, yet important effects of the underlying complex photometric 
process were not captured, cast shadows and specularities both causing incorrect recognition 
decisions. In this chapter we take a further step towards the goal of combining models 
stemming from our understanding of image formation and learning from available data. 

In particular there are two major areas of novelty: (i) illumination generalization is 
achieved using a two-stage method, combining coarse region-based gamma intensity correc¬ 
tion with normalization based on a pose-specific illumination subspace, learnt offline; (ii) 
pose robustness is achieved by decomposing each appearance manifold into semantic Gaus¬ 
sian pose clusters, comparing the corresponding clusters and fusing the results using an RBF 
network. On the ToshFace data set, the proposed algorithm consistently demonstrated a 
very high recognition rate (95% on average), significantly outperforming state-of-the-art 
methods from the literature. 


8.1 Overview 

A video sequence of a moving face carries information about its 3D shape and texture. In 
terms of recognition, this information can be used either explicitly, by recovering parame¬ 
ters of a generative model of the face (e.g. as in [Bla03]), or implicitly by modelling face 
appearance and trying to achieve invariance to extrinsic causes of its variation (e.g. as in 
[Ara05c]). In this chapter we employ the latter approach, as more suited for low-resolution 
input data (see Section 8.6 for typical data quality) [Eve04]. 

In the proposed method, manifolds [Ara05b, Bic94] of face appearance are modelled 
using at most three Gaussian pose clusters describing small face motion around different 
head poses. Given two such manifolds, first (i) the pose clusters are determined, then (ii) 
those corresponding in pose are compared and finally, (iii) the results of pairwise cluster 
comparisons are combined to give a unified measure of similarity of the manifolds themselves. 
Each of the steps, aimed at achieving robustness to a specific set of nuisance parameters, is 
described in detail next. 


8.2 Face registration 

Using the standard appearance representation of a face as a raster-ordered pixel array, it 
can be observed that the corresponding variations due to head motion, i.e. pose changes, are 
highly nonlinear, see Figure 8.1 (a,b). A part of the difficulty of recognition from appearance 
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(b) Face Motion Manifold (c) Clusters 

Figure 8.1: A typical input video sequence of random head motion performed by the user 
(a) and the corresponding face appearance manifold (b). Shown is the projection of affine- 
registered data (see Section 8.2) to the first three linear principal components. Note that 
while highly nonlinear, the manifold is continuous and smooth. Different poses are marked 
in different styles (red stars, blue dots and green squares). Examples of faces from the three 
clusters can be seen in (b) (also affine-registered and cropped). 


manifolds is then contained in the problem of what is an appropriate way of representing 
them, in a way suitable for the analysis of the effects of varying illumination or pose. 

In the proposed method, face appearance manifolds are represented in piece-wise linear 
manner, by a set of semantic Gaussian pose clusters, see Figure 8.1 (b,c). Seeing that each 
cluster describes a locally linear mode of variation, this approach to modelling manifolds 
becomes increasingly difficult as their intrinsic dimensionality is increased. Therefore, it 
is advantageous to normalize the raw, input frames as much as possible so as to minimize 
this dimensionality. In this first step of our method, this is done by registering faces i.e. 
by warping them to have a set of salient facial features aligned. For related approaches see 
[AraOSc, Ber04]. 

We compute warps that align each face with a canonical frame using four point corre¬ 
spondences: the locations of pupils (2) and nostrils (2). These are detected using a two-stage 
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(a) Original (b) Detections (c) Cropped (d) Registered 


Figure 8.2: (a) Original input frame (resolution 320 x 240 pixels), (b) superimposed detec¬ 
tions of the two pupils and nostrils (as white circles), (c) cropped face regions with back¬ 
ground clutter removed, and (d) the final affine registered and cropped image of the face 
(resolution 30 x 30 pixels). 


feature detector of Fukui and Yamaguchi [Fuk98]^. Briefly, in the first stage, shape matching 
is used to rapidly remove a large number of locations in the input image that do not contain 
features of interest. Out of the remaining, ‘promising’ features, true locations are chosen 
using the appearance-based, distance from feature space criterion. We found that the de¬ 
scribed method reliably detected pupils and nostrils across a wide variation in illumination 
conditions and pose. 

From the four point correspondences between the locations of the facial features and 
their canonical locations (we chose canonical locations to be the mean values of true feature 
locations) we compute optimal affine warps on a per-frame basis. Since four correspon¬ 
dences over-determine the affine transformation parameters (8 equations with 6 unknown 
parameters), we estimate them in the minimum L 2 error sense. Finally, the resulting images 
are cropped, so as to remove background clutter, and resized to the uniform scale of 30 x 30 
pixels. An example of a face registered and cropped in the described manner is shown in 
Figure 8.2 (also see Figure 8.1 (c)). 


8.3 Pose-invariant recognition 

Achieving invariance to varying pose is one of the most challenging aspects of face recognition 
and yet a prerequisite condition for most practical applications. This problem is complicated 
further by variations in illumination conditions, which inevitably occur due to movement of 
the user relative to the light sources. 

We propose to handle changing pose in two, complementary stages: (i) in the first stage 

^We thank the authors for kindly providing us with the original code of their algorithm. 
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an appearance manifold is decomposed to Gaussian pose clusters, effectively reducing the 
problem to recognition under a small variation in pose parameters; (ii) in the second stage, 
fixed-pose recognition results are fused using a neural network, trained offline. The former 
stage is addressed next, while the latter is the topic of Section 8.5. 

8.3.1 Defining pose clusters 

Inspection of manifolds of registered faces in random motion around the fronto-parallel face 
shows that they are dominated by the first nonlinear principal component. This princi¬ 
pal component corresponds to lateral head rotation, i.e. changes in the face yaw, see Fig¬ 
ure 8.1 (a,b). The reason for this lies in the greater smoothness of the face surface in the 
vertical than in the horizontal direction - pitch changes (“nodding”) are largely compen¬ 
sated for by using the affine registration described in Section 8.2. This is not the case with 
significant changes, when self-occlusion occurs. 

Therefore, the centres of Gaussian clusters used to linearize an appearance manifold 
correspond to different yaw angle values. In this work we describe the manifolds using three 
Gaussian clusters, corresponding to the frontal face orientation, face left and face right, see 
Figure 8.1 (a,b). 

8.3.2 Finding pose clusters 

As the extent of lateral rotation, as well as the number of frames corresponding to each 
cluster, can vary between video sequences, a generic clustering algorithm, such as the k- 
means algorithm, is unsuitable for finding the three Gaussians. 

With the prior knowledge of the semantics of clusters, we decide on a single face image 
membership on a frame-by-frame basis. We show that this can be done in a very simple 
and rapid manner from already detected locations of the four characteristic facial features: 
the pupils and nostrils, see Section 8.2. 

The proposed method relies on motion parallax based on inherent properties of the shape 
of faces. Consider the anatomy of a human head shown in profile view in Figure 8.3 (a). It 
can be seen that the pupils are further away than the nostrils from the vertical axis defined 
by the neck. Hence, assuming no head roll takes place, as the head rotates laterally, nostrils 
travel a longer projected path in the image. Using this observation, we define the quantity 
Tj as follows: 


r] = x^ — X 


C 

n 


( 8 . 1 ) 
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Figure 8.3: (a) A schematic illustration of the motion parallax used for coarse pose clustering 
of input faces (the diagram is based on a figure taken from [Gral8]). (b) The distributions 
of the scale-normalized parallax measure t) defined in (8.3) for the three pose clusters on the 
offline training data set. Good separation is demonstrated. 


where x% and xf are the mid-points between, respectively, the eyes and the nostrils: 


X 


C 

e 


Xgi ~h Xe.2 
2 


Xfii Xfi2 
2 


( 8 . 2 ) 


It can now be understood that rj approximates the discrepancy between distances travelled 
by the mid-points between the eyes and nostrils, measured from the frontal face orientation. 
Finally, we normalize p by dividing it with the distance between the eyes, to obtain i), the 
scale-invariant parallax measure: 


V = 


\\Xel-Xe2\\ \\Xel - Xe2\ 


(8.3) 


Learning the parallELx model. In our method, discrete poses used for linearizing ap¬ 
pearance manifolds are automatically learnt from a small training corpus of video sequences 
of faces in random motion. To learn the model, we took 20 sequences of 100 frames each, 
acquired at lOfps, and computed the value of fj for each registered face. We then applied 
the k-means clustering algorithm [DudOO] on the obtained set of parallax measure values 
and fitted a ID Gaussian to each, see Figure 8.3 (b). 

To apply the learnt model, a frame in our method is classified to the maximal likelihood 
pose. In other words, when a novel face is to be classified to one of the three pose clusters 
(i.e. head poses), we evaluate pose likelihood given each of the learnt distributions and 
classify it to the one giving the highest probability of the observation. Figure 8.4 shows the 
proportions of faces belonging to each pose cluster. 
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Figure 8.4: Histograms of the number of correctly registered faces using four point correspon¬ 
dences between detected facial features (pupils and nostrils) for each of the three discrete 
poses and in total for each sequence. 


8.4 Illumination-invariant recognition 

Illumination variation of face patterns is extremely complex due to varying surface re¬ 
flectance properties, face shape, and type and distance of lighting sources. Hence, in such a 
general setup, this is a difficult problem to approach in a purely discriminative fashion. 

Our method for compensating for illumination changes is based on the observation that 
on the coarse level most of the variation can be described by the dominant light direction 
e.g. ‘strong light from the left’. Such variations are addressed much easier. We will also 
demonstrate that it is the case that once normalized at this, coarse level, the learning of 
residual illumination changes is significantly simplified as well. This motivates the two-stage, 
per-pose illumination normalization employed in the proposed method: 

1. Coarse level: Region-based gamma intensity correction (GIC), followed by 

2. Fine level: Illumination subspace normalization. 

The algorithm is summarized in Figure 8.5 while its details are explained in the sections 
that follow. 

8.4.1 Gamma intensity correction 

Gamma Intensity Gorrection (GIG) is a well-known image intensity histogram transfor¬ 
mation technique that is used to compensate for global brightness changes [Gon92]. It 
transforms pixel values (normalized to lie in the range [0.0,1.0]) by exponentiation so as 
to best match a canonically illuminated image. This form of the operator is motivated by 
non-linear exposure-image intensity response of the photographic film that it approximates 
well over a wide range of exposure. Formally, given an image / and a canonically illuminated 
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Input; pose clusters Ci = {x^^}, C 2 = {x-^^j, 
face regions mask r, 
mean face (for pose) m, 
pose illumination subspace basis matrix B/. 

Output: pose cluster Ci normalized to C 2 . 


1: Per-frame region-based GIC, sequence 1 

Vz. x^^^ = region_GIC(r, m, x|^^) 

2: Per-frame region-based GIC, sequence 2 

Vz. xp^ = region_GIC(r, m, xp^) 


3: Per-frame illumination subspace compensation 

Vz. xf ^ = B/a* - 1 - xf ^ 


where a* = arg mina^ Dm ah 


B/ai -b xf ^ 


(C 2 ); C 2 


4: The result is the normalized cluster 

Cl = {xW} 


Cl 


Figure 8.5: Proposed illumination normalization algorithm. Coarse appearance changes 
due to illumination variation are normalized using region-based gamma intensity correction, 
while the residual variation is modelled using a linear, pose-specific illumination subspace, 
learnt offline. Local manifold shape is employed as a constraint in the second, ‘fine’ stage 
of normalization in the form of Mahalanobis distance for the computation of the optimal 
additive illumination subspace component. 


image /c, the gamma intensity corrected image I* is defined as follows: 

I*{x,y) = I{x,yy\ (8.4) 

where 7 * is the optimal gamma value and is computed using 

7 * =argmin ||P - Icll = (8.5) 

7 

argmin V [/(a;,y)'^ - Jc(a:,z/)]^ . ( 8 . 6 ) 

7 ' 

x,y 


This is a nonlinear optimization problem in ID. In our implementation of the proposed 
method it is solved using the Golden Section search with parabolic interpolation, see [Pre92] 
for details. 
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Region-based gamma intensity correction. Gamma intensity correction can be used 
across a wide range of types of input to correct for global brightness changes. However, in 
the case of objects with a highly variable surface normal, such as faces, it is unable to correct 
for the effects of side lighting. This is recognized as one of the most difficult problems in 
face recognition [Adi97]. 

Region-based GIG proposes to overcome this problem by dividing the image (and hence 
implicitly the imaged object/face as well) into regions corresponding to surfaces with near¬ 
constant surface normal. Regular gamma intensity correction is then applied to each region 
separately, see Figure 8.6. 

An undesirable result of this method is that it tends to produce artificial intensity discon¬ 
tinuities at region boundaries [ShaOS]. This occurs due to discontinuities in the computed 
gamma values between neighbouring regions. We propose to first Gaussian-blur the obtained 
gamma value map image F*: 



(8.7) 


before applying it to an input image to give the final, region-based gamma corrected output 


T* ■ 


\/x,y. Is{x,y) = /(a;,?/)^sG.!/) 


( 8 . 8 ) 


This method almost entirely remedies the problem with boundary artefacts, as illustrated in 
Figure 8.6. Note that because smoothing is performed on the gamma map, not the processed 
image, the artefacts are removed without any loss of discriminative, high frequency detail, 
see Figure 8.7. 

8.4.2 Pose-specific illumination subspace normalization 

After region-based GIG is applied to all images, for each of the pose clusters, it is assumed 
that the lighting variation can be modelled using a linear, pose illumination subspace. Given 
a reference and a novel cluster corresponding to the same pose, each frame of the novel cluster 
is normalized for the illumination change. This is done by adding a vector from the pose 
illumination subspace to the frame so that its distance from the reference cluster’s centre is 
minimal. 

Learning the model. We define a pose-specific illumination subspace to be a linear 
manifold that explains intra-personal appearance variations due to illumination changes 
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(a) Avg ‘left’ face (b) Original (c) GIC output (d) Smooth output 



(e) Original GIC map (f) Smooth GIC map 


Figure 8.6: Canonical illumination image and the regions used in region-based GIC (a), 
original unprocessed face image (b), region-based GIC corrected image without smoothing (c), 
and region-based GIC corrected image with smoothing (d), gamma value map (e), smoothed 
gamma value map (f). Notice artefacts at region boundaries in the gamma corrected image 
(c). The output of the proposed smooth region-based GIC in (d) does not have the same 
problem. Finally, note that the coarse effects of the strong side lighting in (b) have been 
greatly removed. Gamma value maps corresponding to the original and the proposed methods 
is shown under, respectively, (e) and (f). 


across a narrow range of poses. In other words, this is the principal subspace of the within- 
class scatter. 


Formalizing the definition above, given that is the fc-th of Nf{i,j) frames of person 
i under the illumination j (out of Nfi)), the within-class scatter matrix is: 


Np Ni{i)Nf(i,j) 

Ss = ^ ^ {yffj - x*)(xf_j- - Xi)^, (8-9) 

i=i j=i k=i 

where Np is the total number of training individuals and x^ is the mean face of the person 
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(a) (b) 


Figure 8.7: (a) Seamless output of the proposed smooth region-based GIC. Boundary arte¬ 
facts are removed without blurring of the image. Contrast this with output of a original 
region-based GIC, after Gaussian smoothing (b). Image quality is significantly reduced, with 
boundary edges still clearly visible. 


in the range of considered poses: 


X,; = 


2-^j — l 2-^k—l ^i,j 




( 8 . 10 ) 


The pose-specific illumination subspace basis B/ is then computed by eigendecomposi- 
tion of Ss as the principal subspace explaining 90% of data energy variation. 

For offline learning of illumination subspaces we used 10s video sequences of 20 individ¬ 
uals, each in 5 illumination conditions, acquired at lOfps. The first few basis vectors learnt 
in the described manner are shown as images in Figure 8.8. 


Employing the model. Let Ci = {x^^^ ..., x^j} and C 2 = {x^^^ ..., x^^} be two cor¬ 
responding pose clusters of different appearance manifolds, previously preprocessed using 
the region-based gamma correction algorithm described in Section 8.4.1. Cluster Ci is then 
illumination-normalized with respect to C 2 (we will therefore refer to C 2 as the reference 
cluster), under the null assumption that the identities of the two people they represent are 
the same. The normalization is performed on a frame-by-frame basis, by adding a vector 
B/a* from the estimated pose-specific illumination subspace: 

Vi. xf ^ = B/a*-f xf ^ (8.11) 


where we dehne a* as: 


a* = argmin ||B/ai -L - (C 2 )||, 

SLi 


( 8 . 12 ) 
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(b) Left 



(c) Eigenvalues 


Figure 8.8: Shown as images are the first 5 bases of pose-specific illumination subspaces for 
the (a) frontal and (b) left head orientations. The distribution of energy for pose-specific 
illumination variation across principal directions is shown in (c). 


and II ... II is a vector norm and (C 2 ) the mean face of cluster C 2 . We then define cluster Ci 
normalized to C 2 to be Ci = This form is directly motivated by the definition of a 

pose-specific subspace. 

To understand the next step, which is the choice of the vector norm in (8.12), it is 
important to notice in the definition of the pose-specific illumination subspace, that the basis 
B/ explains not only appearance variations caused by illumination: reflectance properties 
of faces used in training (e.g. their albedos), as well as subjects’ pose changes also affect it. 
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This is especially the case as we do not make the common assumption that surfaces of faces 
are Lambertian, or that light sources are point lights at infinity. 

The significance of this observation is that the subspace of a dimensionality sufficiently 
high to explain the modelled phenomenon (illumination changes) will, undesirably, also be 
able to explain ‘distracting’ phenomena, such as differing identity. The problem is therefore 
that of constraining the region of interest of the subspace to that which is most likely to 
be due to illumination changes for a particular individual. For this purpose we propose 
to exploit the local structure of appearance manifolds, which are smooth. We do this 
by employing the Mahalanobis distance (using the probability density corresponding to the 
reference cluster) when computing the illumination subspace correction for each novel frame 
using (8.12). Formally: 

a* =arginin + xf ^ - (C 2 )) • B 2 A 2 "^B^ (^B/a^ -b - (C 2 )) , (8.13) 

where B 2 and A 2 are, respectively, reference cluster’s orthonormal basis and the diagonalized 
covariance matrix. We found that the use of Mahalanobis distance, as opposed to the 
usual Euclidean distance, achieved better explanation of novel images when the person’s 
identity was the same, and worse when it was different, achieving better inter-to-intra class 
separation. 

This quadratic minimization problem is solved by differentiation and the minimum is 
achieved for: 


a* = (B|’B2A2-1B^B,) ' • ((C 2 ) - x) (8.14) 

Examples of registered and cropped face images before and after illumination normal¬ 
ization can be seen in Figure 8.9 (a). 

Practical considerations. The computation of the optimal value a* using (8.14) involves 
inversion and Principal Component Analysis (PCA) on matrices of size D x D, where D 
is the number of pixels in a face image (in our case equal to 900, see Section 8.2). Both 
of these operations put high demands on computer resources. To reduce the computational 
overhead, we exploit the assumption that data modelled is of much lower dimensionality 
than D. 

Formalizing the model of low-dimensional face manifolds, we assume that an image y 
of subject i’s face is drawn from the probability density Pp\y) within the face space, and 
embedded in the image space by means of a mapping function : R'’* —>■ R^. The 
resulting point in the D-dimensional space is further perturbed by noise drawn from a noise 
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distribution (note that the noise operates in the image space) to form the observed image 
X. Therefore the distribution of the observed face images of the subject i is given by the 
integral: 

pW(x) = Jp^p\y)pn{My) - x)dy (8.15) 

This model is then used in two stages: 

1. Pose-specific PC A dimensionality reduction, 

2. Exact computation of the linear principal and rapid estimation of the complementary 
subspace of a pose cluster. 

Specifically, we first perform a linear projection of all images in a specific pose cluster 
to a pose-specific face subspace that explains 95% of data variation in a specific pose. This 
achieves data dimensionality reduction from 900 to 250. 

Referring back to (8.15), to additionally speed up the process, we estimate the intrinsic 
dimensionality of face manifolds (defined as explaining 95% of within-cluster data variabil¬ 
ity) and assume that all other variation is due to isotropic Gaussian noise Pn- Hence, we 
can write the basis of the PCA subspace corresponding to the reference cluster as consist¬ 
ing of a principal and complementary subspaces [Tip99b] represented by orthonormal basis 
matrices, respectively Vp and Vc: 


B 2 = [VpVc] (8.16) 

where Vp £ K^soxe Vc £ ]^250x244^ principal subspace and the associated eigen¬ 
vectors Vi,..., Vg are rapidly computed, e.g. using [Bag96]. The isotropic noise covariance 
and the complementary subspace basis are then estimated in the following manner: 

6 

Vc = null(Vp) (8.17) 

i=l 

where the nullspace of the principal subspace is computed using the QR-decomposition 
[Pre92] and the value of uj estimated from a small training corpus; we obtained ui ~ 2.2e—4. 
The diagonalized covariance matrix is then simply: 

244 

A 2 = diag(Ai,..., Ag, A„,..., A„) (8.18) 
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(a) Illumination normalization 



(b) Original clusters (c) Normalized clusters 


Figure 8.9: In (a) are respectively, top to bottom, shown the original registered and cropped 
face images from an input video sequence, the same faces after the proposed illumination 
normalization and a sample from the reference video sequence. The effects of strong side 
lighting have been greatly removed, while at the same time a high level of detail is retained. 
The corresponding data from the two sequences, before and after illumination compensation 
are shown under (b) and (c). Shown are their projections to the first two principal com¬ 
ponents. Notice that initially the clusters were completely non-overlapping. Illumination 
normalization has adjusted the location of centre of the blue cluster, but has also affected its 
spread, after normalization, while overlapping, the two sets of patterns are still distributed 
quite differently. 


8.5 Comparing normalized pose clusters 

Having illumination normalized one face cluster to match another, we want to compute a 
similarity measure between them, a distance, expressing our degree of belief that they belong 
to the same person. 

At this point it is instructive to examine the effects of the described method for illu¬ 
mination normalization on the face patterns. Two clusters before, and after one has been 
normalized, are shown in Figure 8.9 (b,c). An interesting artefact can be observed: the 
spread of the normalized cluster is significantly reduced. This is easily understood by refer- 
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ring back to (8.11)-(8.12) and noticing that the normalization is performed frame-by-frame, 
trying to make each normalized face as close as possible to the reference cluster’s mean, i.e. 
a single point. For this reason, dissimilarity measures between probability densities com¬ 
mon in the literature, such as such as the Bhattacharyya distance, the Kullback-Leibler 
divergence [Ara05b, Sha02a] or the Resistor-Average distance [Ara06e, JohOl], are not suit¬ 
able choices. Instead, we propose to use the simple Euclidean distance between normalized 
cluster centres: 


D{Ci,C2) 


~( 1 ) 


Ni 


\^N2 ( 2 ) 

N2 


(8.19) 


Inter-manifold distance 


The last stage in the proposed method is the computation of an inter-manifold distance, 
or an inter-manifold dissimilarity measure, based on the distances between corresponding 
pose clusters. There are two main challenges in this problem: (i) depending on the poses 
assumed by the subjects, one or more clusters, and hence the corresponding distances, may 
be void; (ii) different poses are not equally important, or discriminative, in terms of face 
recognition [Sim04]. 

Writing d for the vector containing the three pose cluster distances, we want to classify a 
novel appearance manifold to the gallery class giving the highest probability of corresponding 
to it in identity, P(s|d). Then, using the Bayes’ theorem: 


P{s\d) 


p{d\s)P{s) 

p{d) 

p{d\s)Pis) 


p{d\s)P{s) + p{d\-is)P{-'s) 


1 

1 -|-p(d|-is)P(-'s)/p(d|s)P(s) 


( 8 . 20 ) 

( 8 . 21 ) 

( 8 . 22 ) 


Assuming that the ratio of same-identity to differing-identities priors P{->s)/P{s) is a 
constant across individuals, it is clear than classifying to the class with the highest P(s|d) 
is equivalent to classifying to the class with the highest likelihood ratio: 


p{d) 


P(d|s) 

P(dhs) 


(8.23) 
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Learning pose likelihood ratios. Understanding that d = [Di, D 2 , Ds]"’" we assume 
statistical independence between pose cluster distances: 

3 

p(d|s) = ]^p(A|s) (8.24) 

3 

p(d|-.s) = ]^p(Ahs) (8.25) 

i=l 

We propose to learn likelihood ratios n{Di) = p(d|s)/p(d|-is) offline, from a small data 
corpus, labelled by the identity, in two stages. First, (i) we obtain a Parzen window estimate 
of intra- and inter- personal pose distances by comparing all pairs of training appearance 
manifolds; then (ii) we refine the estimates using a Radial Basis Functions (RBF) artificial 
neural network trained for each pose. 

A Parzen window-based [DudOO] estimate of for the frontal head orientation, ob¬ 

tained by directly comparing appearance manifolds as described in Sections 8.2-8.5 is shown 
in Figure 8.10 (a). In the proposed method, this, and the similar likelihood ratio estimates 
for the other two head poses are not used directly for recognition as they suffer from an im¬ 
portant limitation: the estimates are ill-defined in domain regions sparsely populated with 
training data. Specifically, an artefact caused by this problem can be observed by noting 
that the likelihood ratios are not monotonically decreasing. What this means is that more 
distant pose clusters can result in higher chance of classifying two sequences as originating 
from the same individual. 

To overcome the problem of insufficient training data, we train a two-layer RBF-based 
neural network for each of the discrete poses used in approximating face appearance mani¬ 
folds, see Figure 8.10 (c). In its basic form, this means that the estimate is given by 

the following expression: 


where: 


fi{Di) — 'y {Dj ; ; o’j); 

3 


(8.26) 


(jj) — 

1 




exp - 


2a2 


(8.27) 


In the proposed method, this is modified so as to enforce prior knowledge on the func- 
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tional form of in the form of its monotonicity: 


A* (A) = 



Oj Q{Di] fij, aj ), 
3 



(8.28) 


Finally, to ensure that the networks are trained using reliable data (in the context of 
training sample density in the training domain), we use only local peaks of Parzen window- 
based estimates. Results using six second-layer neurons, each with the spread of Uj = 60, 
see (8.28), are summarized in Figures 8.10 and 8.11. 


8.6 Empirical evaluation 

Methods in this chapter were evaluated on the ToshFace data set. To establish baseline 
performance, we compared our recognition algorithm to: 

• Mutual Subspace Method (MSM) of Fukui and Yamaguchi [FukOS], 

• KL divergence-based algorithm of Shakhnarovich et al. (KLD) [Sha02a], 

• Majority vote across all pairs of frames using Eigenfaces of Turk and Pentland [TurQla]. 

In the KL divergence-based method we used principal subspaces that explain 85% of data 
variation energy. In MSM we set the dimensionality of linear subspaces to 9 and used 
the first 3 principal angles for recognition, as suggested by the authors in [FukOS]. For the 
Eigenfaces method, the 22-dimensional eigenspace used explained 90% of total training data 
energy. 

Offline training, i.e. learning of the pose-specific illumination subspaces and likelihood 
ratios, was performed using 20 randomly chosen individuals in 5 illumination settings, for a 
total of 100 sequences. These were not used for neither gallery data nor test input for the 
evaluation reported in this section. 

Recognition performance of the proposed system was assessed by training it with the 
remaining 40 individuals in a single illumination setting, and using the rest of the data as 
test input. In all tests, both training data for each person in the gallery, as well as test data, 
consisted of only a single sequence. 
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(a) Raw estimate 



(b) RBF interpolated estimate 



(c) RBF network architecture 


Figure 8.10: Likelihood ratio corresponding to frontal head pose obtained from the training 
corpus using Parzen windows (a) and the RBF network-based likelihood ratio (b). The 
corresponding RBF network architecture is shown in (c). Note that the initial estimate (a) 
is not monotonically decreasing, while (b) is. 
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Figure 8.11: Joint RBF network-based likelihood ratio for the frontal and left head orienta¬ 
tions. 


Table 8.1: Recognition performance (%) of the proposed method using different illuminations 
for training and test input. Excellent results are demonstrated with little dependence of the 
recognition rate on the data acquisition conditions. 



IL. 1 

IL. 2 

IL. 3 

IL. 4 

IL. 5 

mean 

std 

IL. 1 

100 

90 

95 

95 

90 

94 

4.2 

IL. 2 

95 

95 

95 

95 

90 

94 

2.2 

IL. 3 

95 

95 

100 

95 

100 

97 

2.7 

IL. 4 

95 

90 

100 

100 

95 

96 

4.2 

IL. 5 

100 

80 

100 

95 

100 

95 

8.7 

mean 

97 

90 

98 

96 

95 

95.2 

4.5 


8.6.1 Results 

The performance of the proposed method is summarized in Table 8.1. We tabulated the 
recognition rates achieved across different combinations of illuminations used for training 
and test input, so as to illustrate its degree of sensitivity to the particular choice of data 
acquisition conditions. An average rate of 95% was achieved, with a mean standard deviation 
of only 4.7%. Therefore, we conclude that the proposed method is successful in recognition 
across illumination, pose and motion pattern variation, with high robustness to the exact 
imaging setup used to provide a set of gallery videos. 

This conclusion is further corroborated by Figure 8.12 (a), which shows cumulative distri¬ 
butions of inter- and intra-personal manifold distances (see Section 8.5) and Figure 8.12 (b) 
which plots the Receiver-Operator Characteristic of the proposed algorithm. Good class 
separation can be seen in both, illustrating the suitability of our method for verification 
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(a) (b) 


Figure 8.12: Cumulative distributions of intra-personal (dashed line) and inter-personal 
(solid line) distances (a). Good separability is demonstrated. The corresponding ROC curve 
can be seen in (b) - less than 0.5% of false positive rate is attained for 91% true positive rate. 
The corresponding distance threshold choice is numerically well-conditioned, as witnessed by 
close-to-zero derivatives of the plots in (a) at the corresponding point. 


(one-against-one matching) applications: less than 0.5% false positive rate is attained for 
91.2% true positive rate. Additionally, it is important to note that good separation is main¬ 
tained across a wide range of distances, as can be seen in Figure 8.12 (a) from low gradients 
of inter- and intra- class distributions e.g. on the interval between 1.0 and 15.0. This is sig¬ 
nificant as it implies that the interclass threshold choice is not very numerically sensitive: by 
choosing a threshold in the middle of this range, we can expect the recognition performance 
to generalize well to different data sets. 

Pose clusters 

One of the main premises that this work rests on is the idea that illumination and pose 
robustness in recognition can be achieved by decomposing an appearance manifold into 
a set of pose ranges (see Section 8.3.1) which are, after being processed independently, 
probabilistically combined (see Section 8.5). We investigated the discriminating power of 
each of the three pose clusters used in the proposed context by performing recognition using 
the inter-cluster distance defined in Section 8.5. Table 8.2 show a summary of the results. 
High recognition rates were achieved even using only a single pose cluster. Furthermore, 
the proposed method for integrating cluster distance into a single inter-manifold distance 
can be seen to improve the average performance of the most discriminative pose. In the 
described recognition framework, side poses contributed more discriminative information 
to the distance than the frontal pose (in spite of a lower average number of side faces per 
sequence, see Figure 8.4 in Section 8.2), as witnessed by both a higher average recognition 
accuracy and lower standard deviation of recognition. It is interesting to observe that this 
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Table 8.2: A comparison of identification statistics for recognition using each of the pose- 
speeific cluster distances separately and the proposed method for combining them using an 
RBF-based neural network. In addition to the expected performance improvement when using 
all over only some poses, it is interesting to note different contributions of side and frontal 
pose clusters, the latter being more discriminative in the context of the proposed method. 


Measure 

Manifold distance 

Front clusters distance 

Side clusters distance 

mean 

95 

90 

93 

std 

4.7 

5.7 

3.6 


Table 8.3: Average recognition rates (%) of the compared methods across different illumi¬ 
nation conditions used for training and test. The performance of the proposed method is by 
far the best, both in terms of the average recognition rate and its variance. 


Method 

Proposed method 

Majority vote, Eigenfaces 

KLD 

MSM 

mean 

95 

43 

39 

24 

std 

4.7 

31.9 

32.5 

38.9 


is in agreement with the finding that appearance in a roughly semi-profile head pose is 
inherently most discriminative for recognition [Sim04]. 

Other algorithms 

The result of the comparison with the other evaluated methods is shown in Table 8.3. The 
proposed algorithm outperformed others by a significant margin. Majority vote using Eigen- 
faces and the KL divergence algorithm performed with statistically insignificant difference, 
while MSM showed least robustness to the extreme changes in illumination conditions. It is 
interesting to note that all three algorithms achieved perfect recognition when training and 
test sequences were acquired in the same illumination conditions. Considering the simplicity 
and computational efficiency of these methods, investigation of their behaviour when used 
on preprocessed data (e.g. high-pass filtered images [AraOSc, Fit02] or self-quotient images 
[Wan04a]) appears to be a promising research direction. 

Failure modes 

Finally, we investigated the main failure models of our algorithm. An inspection of failed 
recognitions suggests that the largest difficulty was caused by significant user motion to and 
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from the camera. During the data acquisition, for some of the illumination conditions the 
dominant light sources were relatively close to the user (from Ri 0.5m). This invalidated 
the implicit assumption that illumination conditions were unchanging within a single video 
sequence i.e. that the main cause of appearance changes in images was head rotation. 

Another limitation of the method was observed in cases when only few faces were clus¬ 
tered to a particular pose, either because of facial feature detection failure or because the 
user did not spend enough time in a certain range of head poses. The noisy estimate of the 
corresponding cluster density in (8.16) propagated the estimation error to illumination nor¬ 
malized images and finally to the overall manifold distance, reducing the separation between 
classes. 


8.7 Summary and conclusions 

In this chapter we introduced a novel algorithm for face recognition from video, robust to 
changes in illumination, pose and the motion pattern of the user. This was achieved by 
combining person-specific face motion appearance manifolds with generic pose-specific il¬ 
lumination manifolds, which were assumed to be linear. Integrated into a fully automatic 
practical system, the method has demonstrated a high recognition rate in realistic, uncon¬ 
trolled data acquisition conditions. 


Related publications 

The following publications resulted from the work presented in this chapter: 

• O. Arandjelovic and R. Cipolla. An illumination invariant face recognition system 
for access control using video. In Proc. lAPR British Machine Vision Conference 
(BMVC), pages 537-546, September 2004. [Ara04c] 
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In the previous chapter it was shown how a priori domain-specific knowledge can be 
combined with data-driven learning to reliably recognize in the presence of illumination, 
pose and motion pattern variations. The main limitations of the proposed method are: 
(i) the assumption of linearity of pose-specific illumination subspaces, (ii) the coarse pose- 
based fusion of discriminative information from different frames, and (iii) the appearance 
distribution artifacts introduced during pose-specific illumination normalization. 

This chapter finalizes the part of the thesis that deals with robustly comparing two 
face motion sequences. We describe the Generic Shape-Illumination Manifold recognition 
algorithm that in a principled manner handles all of the aforementioned limitations. 

In particular there are three areas of novelty: (i) we show how a photometric model 
of image formation can be combined with a statistical model of generic face appearance 
variation to generalize in the presence of extreme illumination changes; (ii) we use the 
smoothness of geodesically local appearance manifold structure and a robust same-identity 
likelihood to achieve robustness to unseen head poses; and (iii) we introduce a precise video 
sequence “reillumination” algorithm to achieve robustness to face motion patterns in video. 

The proposed algorithm consistently demonstrated a nearly perfect recognition rate (over 
99.5% on CamFace, ToshFace and Face Video data sets), significantly outperforming state- 
of-the-art commercial software and methods from the literature. 


9.1 Synthetic reillumination of face motion manifolds 

One of the key ideas of this chapter is the algorithm for reillumination of video sequences. 
Our goal is to take two input sequences of faces and produces a third, synthetic one, that 
contains the same poses as the first in the illumination of the second one. For the proposed 
method, the crucial properties are the (i) continuity and (ii) smoothness of face motion 
manifolds, see Figure 9.1 

The proposed method consists of two stages. First, each face from the first sequence is 
matched with the face from the second that corresponds to it best in terms of pose. Then, 
a number of faces close to the matched one are used to finely reconstruct the reilluminated 
version of the original face. Our algorithm is therefore global, unlike most of the previous 
methods which use a sparse set of detected salient points for registration, e.g. [Ara05c, 
Ber04, FukOS]. We have found these to fail on our data set due to the severity of illumination 
conditions (see Section B.2). The two stages of the proposed algorithm are next described 
in detail. 
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(a) Face Motion Manifold (FMM) (b) Shape-Illumination Manifold 


Figure 9.1: Manifolds of (a) face appearance and (b) albedo-free appearanee i.e. the effects 
of illumination and pose changes, in a single motion seguence. Shown are projections to the 
first 3 linear principal components, with a typieal manifold sample on the top-right. 


9.1.1 Stage 1: pose matching 

Let and be two motion sequences of a person’s face in two different illu¬ 
minations. Then, for each Xj-^^ we are interested in finding that corresponds to it 

best in terms of head pose. Finding the unknown mapping c on a frame-by-frame basis is 
difficult in the presence of extreme illumination changes and when face images are of low 
resolution. Instead, we exploit the face manifold smoothness by formulating the problem as 
a minimization task with the fitness function taking on the form: 


/*(o) — fmatch(.cj UJfregi.C^ 




4 ' 




41) v(l) 


;{X, 


Matching term. 


Regularization term 


(9.1) 

(9.2) 


where n(i,j) is the j-th of K nearest neighbours of face i, d^ a pose dissimilarity function 
and a geodesic distance estimate along the FMM of sequence k. The first term is 
easily understood as a penalty for dissimilarity of matched pose-signatures. The latter is 
a regularizing term that enforces a globally good matching by favouring mappings that 
map geodesically close points from the domain manifold to geodesically close points on the 
codomain manifold. 


Regularization. The manifold-oriented nature of the regularizing function freg{c) in (9.2) 
has significant advantages over alternatives that use some form temporal smoothing. Firstly, 
it is unaffected by changes on the motion pattern of the user (i.e. sequential ordering of 
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Figure 9.2: Manifold-to-manifold pose matching: geodesic distances between neighbouring 
faces on the domain manifold and the corresponding faces on the codomain manifold are 
used to regularize the solution. 



(a) Original 



(b) Reilluminated 


Figure 9.3: (a) Original images from a novel video sequence and (b) the result of reillumi¬ 
nation using the proposed genetic algorithm with nearest neighbour-based reconstruction. 
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On top of the inherent benefit (a person’s motion should not affect recognition), 
this is important for several practical reasons, e.g. 

• face images need not originate from a single sequence - multiple sequences are easily 
combined together by computing the union of their frame sets, and 

• regularization works even if there are bursts of missed or incorrect face detections (see 
Section B.2). 

To understand the form of the regularizing function note that the mapping function c 
only affects the numerator of each summation term in fregie)- Its effect is then to penalize 
cases in which neighbouring faces of the domain manifold map to geodesically distant faces 
on the codomain manifold. The penalty is further weighted by the inverse of the original 
geodesic distance {Xj}Gl) to place more emphasis on local pose agreement. 

Pose-matching function. The performance of function ds in (9.2) at estimating the 
goodness of a frame match is crucial for making the overall optimization scheme work well. 
Our approach consists of filtering the original face image to produce a quasi illumination- 
invariant pose-signature, which is then compared with other pose-signatures using the Eu¬ 
clidean distance: 


ds 



'V'(2) ' 

’ c(i). 



(9.3) 


Note that the signatures are only used for frame matching and thus need not retain any 
power of discrimination between individuals - all that is needed is sufficient pose information. 
We use a distance-transformed edge map of the face image as a pose-signature, motivated 
by the success of this representation in object-configuration matching across other computer 
vision applications, e.g. [GavOO, Ste03]. 


Minimizing the fitness function. Exact minimization of the fitness function (9.2) over 
all functions c is an NP-complete problem. However, since the hnal synthesis of novel faces 
(Stage 2) involves an entire geodesic neighbouring of the paired faces, it is inherently robust 
to some non-optimality of this matching. Therefore, in practice, it is sufficient to find a 
good match, not necessarily the optimal one. 

We propose to use a genetic algorithm (GA) [DudOO] as a particularly suitable approach 
to minimization for our problem. GAs rely on the property of many optimization problems 
that sub-solutions of good solutions are good themselves. Specifically, this means that if we 
have a globally good manifold match, then local matching can be expected to be good too. 
Hence, combining two good matches is a reasonable attempt at improving the solution. This 
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Property 

Population 

size 

Elite 

survival no. 

Mutation 

(%) 

Migration 

(%) 

Crossover 

(%) 

Max. 

generations 

Value 

20 

2 

5 

20 

80 

200 


(a) 



(b) (c) 


Figure 9.4: (a) The parameters of the proposed GA optimization, (b) the corresponding chro¬ 
mosome structure and (c) the population fitness (see (9.2)^ in a typical evolution. Maximal 
generation count of 200 was chosen as a trade-off between accuracy and matching speed. 


motivates the chromosome structure we use, depicted in Figure 9.4 (a), with the i-th gene 
in a chromosome being the value of c(i). GA parameters were determined experimentally 
from a small training set and are summarized in Figure 9.4 (b,c). 


Estimating geodesic distances. The definition of the fitness function in (9.2) involves 
estimates of geodesic distances along manifolds. Due to the nonlinearity of FMMs [AraOSb, 
Lee03] it is not well approximated by the Euclidean distance. We estimate the geodesic 
distance between every two faces from a manifold using the Floyd’s algorithm [Cor90] on 
a constructed undirected graph whose nodes correspond to face images (also see [TenOO]). 
Then, if is one of the K nearest neighbours of Xj ^: 

dG(X„X,) = ||X,-X,|l2. (9.4) 


Otherwise: 


dG(X„X,) = min[dG(X„Xfc) + dG(Xfc,X,)]. (9.5) 

k 


^Note that the converse does not hold as being one of the K nearest neighbours of Xj does not imply 
that Xj is one of the K nearest neighbours of Xj. Therefore the edge relation of this graph is a superset of 
the “in iC-nearest neighbours” relation on Xs. 
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9.1.2 Stage 2: fine reillumination 

Having computed a pose-matching function c*, we turn to the problem of reilluminating 
frames We exploit the smoothness of pose-signature manifolds (which was ensured 

by distance-transforming face edge maps), illustrated in Figure 9.5, by computing the 
reillunhnated frame as a linear combination of K nearest-neighbour frames of 

Linear combining coefficients ai,... aK are found from the corresponding pose-signatures 
by solving the following constrained minimization problem: 


{aj} = arginin 
{“H 



K 

k=l 2 


(9.6) 


subject to J2k=i = 1.0, where is the pose-signature corresponding to Xp\ In other 
words, the pose-signature of a novel face is first reconstructed using the pose-signatures of 
K training faces (in the target illumination), which are then combined in the same fashion 
to synthesize a reilluminated face, see Figure 9.3 and 9.6. We restrict the set of frames used 
for reillumination to the iG-nearest neighbours for two reasons. Firstly, the computational 
time of using all faces would make this highly unpractical. Secondly, the nonlinearity of 
both face appearance manifolds and pose-signature manifolds, demands that only the faces 
in the local. Euclidean-like neighbourhood are used. 

Optimization of (9.6) is readily performed by differentiation giving: 


where: 


Q!2 

Q!3 

pK, 
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n(c’ 

b)- 


(9.7) 


(9.8) 

(9.9) 


9.2 The shape-illumination manifold 


In most practical applications, specularities, multiple or non-point light sources significantly 
affect the appearance of faces. We believe that the difficulty of dealing with these effects is 
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Figure 9.5: A face motion manifold in the input image space and the corresponding pose- 
signature manifold (both shown in their respective 3D principal subspaces). Much like the 
original appearance manifold, the latter is continuous and smooth, as ensured by distance 
transforming the face edge maps. While not necessarily similar globally, the two manifolds 
retain the same local structure, which is crucial for the proposed fine illumination algorithm. 



Figure 9.6: Face reillumination: the coefficients for linearly combining face appearance im¬ 
ages (bottom row) are computed using the corresponding pose-signatures (top row). Also see 
Figure 9.5. 
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one of the main reasons for poor performance of most face recognition systems when put to 
use in a realistic environment. In this work we make a very weak assumption on the process 
of image formation: the only assumption made is that the intensity of each pixel is a linear 
function of the albedo a{j) of the corresponding 3D point: 

X{j) = a{j) ■ s(j) (9.10) 

where s is a function of illumination, shape and other parameters not modelled explicitly. 
This is similar to the reflectance-lighting model used in Retinex-based algorithms [Kim03], 
the main difference being that we make no further assumptions on the functional form of 
s. Note that the commonly-used (e.g. see [Bla03, Geo98, RROl]) Lambertian reflectance 
model is a special case of (9.10) [Bel98]: 

s{j) = X] max(nj • L,, 0) (9.11) 

i 

where is the corresponding surface normal and {L^} the intensity-scaled illumination 
directions at the point. 

The image formation model introduced in (9.10) leaves the image pixel intensity as an 
unspecified function of face shape or illumination parameters. Instead of formulating a 
complex model of the geometry and photometry behind this function (and then needing to 
recover a large number of model parameters), we propose to learn it implicitly. Consider 
two images, Xi and X 2 of the same person, in the same pose, but different illuminations. 
Then from (9.10): 

AlogX(j) = logS 2 (j) - logsi(j) = ds{j) (9.12) 

In other words, the difference between these logarithm-transformed images is not a function 
of face albedo. As before, due to the smoothness of faces, as the pose of the subject varies 
the difference-of-logs vector dg describes a manifold in the corresponding embedding vector 
space. These is the Shape-Illumination manifold (SIM) corresponding to a particular pair 
of video sequences, refer back to Figure 9.1 (b). 


The generic SIM. A crucial assumption of our work is that the Shape-Illumination 
Manifold of all possible illuminations and head poses is generic for human faces (gSIM). 
This is motivated by a number of independent results reported in the literature that have 
shown face shape to be less discriminating than albedo across different models [Cra99, 
Gro04] or have reported good results in synthetic reillumination of faces using the constant- 
shape assumption [RROl]. In the context of face manifolds this means that the effects of 
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illumination and shape can be learnt offline from a training corpus containing typical modes 
of pose and illumination variation. 

It is worth emphasizing the key difference in the proposed offline learning from previous 
approaches in the literature which try to learn the albedo of human faces. Since offline 
training is performed on persons not in the online gallery, in the case when albedo is learnt 
it is necessary to have means of generalization i.e. learning what possible albedos human 
faces can have from a small subset. In [RROl], for example, the authors demonstrate 
generalization to albedos in the rational span of those in the offline training set. This 
approach is not only unintuitive, but also without a meaningful theoretical justification. On 
the other hand, previous research indicates that illumination effects can be learnt direetly 
without the need for generalization [AraOSb]. 

Training data organization. The proposed method consists of two training stages - a 
one-time offline learning performed using offline training data and a stage when gallery data 
of known individuals with associated identities is collected. The former (explained next) 
is used for learning the generic face shape contribution to face appearance under varying 
illumination, while the latter is used for subject-specific learning. 

9.2.1 Offline stage: learning the generic SIM (gSIM) 

Let be the i-th face of the j-th person in the fc-th illumination, same indexes cor¬ 

responding in pose, as ensured by the proposed reillumination algorithm in Section 9.1. 
Then from (9.12), samples from the generic Shape-Illumination manifold can be computed 
by logarithm-transforming all images and subtracting those corresponding in identity and 
pose: 


d = log - log Xp’'^^ (9.13) 

Provided that training data contains typical variations in pose and illumination (i.e. that 
the p.d.f. confined to the generic SIM is well sampled), this becomes a standard statis¬ 
tical problem of high-dimensional density estimation. We employ the Gaussian Mixture 
Model (GMM). In the proposed framework, this representation is motivated by: (i) the 
assumed low-dimensional manifold model (3.1), (ii) its compactness and (iii) the existence 
of incremental model parameter estimation algorithms (e.g. [AraOSa, HalOO]). 

Briefly, we estimate multivariate Gaussian components using the Expectation Maxi¬ 
mization (EM) algorithm [DudOO], initialized by fc-means clustering. Automatic model 
order selection is performed using the well-known Minimum Description Length criterion 
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Figure 9.7: Learning complex illumination effects: Shown is the variation along the 1st mode 
of a single PPCA space in our SIM mixture model. Cast shadows (e.g. from the nose) and 
the locations of specularities (on the nose and above the eyes) are learnt as the illumination 
source moves from directly overhead to side-overhead. 


[DudOO] while the principal subspace dimensionality of PPCA components was estimated 
from eigenspectra of covariance matrices of a diagonal GMM fit, performed first. Fitting 
was then repeated using a PPCA mixture. From 6123 gSIM samples computed from 100 
video sequences, we obtained 12 mixture components, each with a 6D principal subspace. 
Figure 9.7 shows an example of subtle illumination effects learnt with this model. 


9.3 Novel sequence classification 

The discussion so far has concentrated on offline training and building an illumination model 
for faces - the Generic Shape-Illumination manifold. Central to the proposed algorithm was 
a method for reilluminating a face motion sequence of a person with another sequence of the 
same person (see Section 9.1). We now show how the same method can be used to compute 
a similarity between two unknown individuals, given a single training sequence for each and 
the Generic SIM. 

Let gallery data consist of sequences ..., corresponding to N individ¬ 
uals, be a novel sequence of one of these individuals and t/(x; 0) a Mixture of 

Probabilistic PCA corresponding to the generic SIM. Using the reillumination algorithm 
of Section 9.1, the novel sequence can be reillunhnated with each from the gallery, 

producing samples We assume these to be identically and independently distributed 

according to a density corresponding to a postulated subject-specific SIM. We then compute 
the probability of these under Q (x; ©): 

pp)=0(dp);©) (9.14) 
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When and correspond in identity, from the way the Generic SIM is 

learnt, it can be seen that the probabilities will be large. The more interesting question 
arises when the two compared sequences do not correspond to the same person. In this case, 
the reillumination algorithm will typically fail to produce a meaningful result - the output 
frames will not correspond in pose to the target sequence, see Figure 9.8. Consequently, the 
observed appearance difference will have a low probability under the hypothesis that it is 
caused purely by an illumination change. A similar result is obtained if the two individuals 
share sufficiently similar facial lines and poses are correctly matched. In this case it is the 
differences in face surface albedo that are not explained well by the Generic SIM, producing 
lowpp^ in (9.14). 

Varying pose and robust likelihood. Instead of basing the classification of on 

the likelihood of observing the entire set in (9.14), we propose a more robust measure. 

To appreciate the need for robustness, consider the histograms in Figure 9.9 (a). It can be 
observed that the likelihood of the most similar faces in an inter-personal comparison, in 
terms of (9.14), approaches that of the most dissimilar faces in an intra-personal comparison 
(sometimes even exceeding it). This occurs when the correct gallery sequence contains poses 
that are very dissimilar to even the most similar ones in the novel sequence, or vice versa 
(note that small dissimilarities are extrapolated well from local manifold structure using 
(9.6)). In our method, the robustness to these, unseen modes of pose variation is achieved 
by considering the mean log-likelihood of only the most likely faces. In our experiments we 
used the top 15% of the faces, but we found the algorithm to exhibit little sensitivity to the 
exact choice of this number, see Figure 9.9 (b). A summary of the proposed algorithms is 
shown in Figure 9.10 and 9.11. 


9.4 Empirical evaluation 

We compared the performance of our recognition algorithm with and without the robust 
likelihood of Section 9.3 (i.e. using only the most reliable vs. all detected and reilluminated 
faces) on CamFace, ToshFace and Face Video data sets to that of: 

• State-of-the-art commercial system Facelt® by Identix [Ide03] (the best performing 
software in the most recent Face Recognition Vendor Test [Phi03]), 

• Gonstrained MSM (CMSM) [Fuk03] used in Toshiba’s state-of-the-art commercial sys¬ 
tem FacePass® [Tos06]^, 


^The algorithm was reimplemented through consultation with the authors. 
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(b) 


Figure 9.8: An example of “reillumination” results when the two compared sequences do not 
correspond to the same individual: the target sequence is shown on the left, the output of 
our algorithm on the right. Most of the frames do not contain faces which correspond in 
pose. 



(a) Histograms (b) Recognition 


Figure 9.9: (a) Histograms of intra-personal likelihoods across frames of a sequence when two 
sequences compared correspond to the same (red) and different (blue) people, (b) Recognition 
rate as a function of the number of frames deemed ‘reliable ’. 
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Input: database of sequences 

Output: model of gSIM Q (d; 0). 


1: gSIM iteration 

for all persons i and illuminations j, k 

2: Reilluminate using 

= reilluminate({Xi}('^4. 

3: Add gSIM samples 

E) = D U I - Xp^ : f = 1... I 

4: Fit GMM Q from gSIM samples 

g (d; 0) =EM.GMM(D) 


Figure 9.10: A summary of the proposed offline learning algorithm. Illumination effects on 
the appearance of faces are learnt as a probability density, in the proposed method approxi¬ 
mated with a Gaussian mixture Q (d; 0). 


Input: sequences 

Output: same-identity likelihood p. 


1: Reilluminate using 

{Yi}(^) = reilluminate({Xj(^);{Xi}(^)) 

2: Postulate SIM samples 

d, = logXl'^) - log Y^"^^ 

3: Compute likelihoods of {d^} 

Pi = G (di; 0) 

4: Order {di} by likelihood 

Ps(l) > • • • > Ps{N) > ■ ■■ 

5: Inter-manifold similarity p 

P = EfcilogPs(i)/A 


Figure 9.11: A summary of the proposed online recognition algorithm. 
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• Mutual Subspace Method (MSM) [Fuk03, Mae04]^, 

• Kernel Principal Angles (KPA) of Wolf and Shashua [Wol03]^, and 

• KL divergence-based algorithm of Shakhnarovich et al. (KLD) [Sha02a]^. 


In all tests, both training data for each person in the gallery, as well as test data, consisted 
of only a single sequence. Offline training of the proposed algorithm was performed using 
20 individuals in 5 illuminations from the CamFace data set ~ we emphasize that these 
were not used as test input for the evaluations reported in this section. The methods were 
evaluated using 3 face representations: 

• raw appearance images X, 

• Gaussian high-pass filtered images - used for face recognition in [AraOSc, Fit02]: 

Xff =X-(X*G,^i,5), (9.15) 

• local intensity-normalized high-pass filtered images - similar to the Self Quotient Image 
[Wan04a] (also see [Ara06f|): 


Xq =Xj^./(X-Xff), 


(9.16) 


the division being element-wise. 

Background clutter was suppressed using a weighting mask Mj;’, produced by feathering 
the mean face outline M in a manner similar to [Ara05c] and as shown in Figure 9.12: 


MiT- = M * exp — 


r'^{x,y) 


(9.17) 


9.4.1 Results 

A summary of experimental results is shown in Table 9.1. The proposed algorithm greatly 
outperformed other methods, achieving a nearly perfect recognition (99.3-1-%) on all 3 
databases. This is an extremely high recognition rate for such unconstrained conditions 
(see Figure 2.16), small amount of training data per gallery individual and the degree of 

®We used the original authors’ implementation. 
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(a) Mask (b) Representations 


Figure 9.12: (a) The weighting mask used to suppress background clutter, (b) The three face 
representations used in evaluation, shown as images, before (top row) and after (bottom 
row) the weighting mask was applied. 


illumination, pose and motion pattern variation between different sequences. This is wit¬ 
nessed by the performance of Simple KLD method which can be considered a proxy for 
gauging the difficulty of the task, seeing that it is expected to perform well if imaging 
conditions are not greatly different between training and test [Sha02a]. Additionally, it is 
important to note the excellent performance of our algorithm on the Japanese database, 
even though offline training was performed using Caucasian individuals only. 

As expected, when plain likelihood was used instead of the robust version proposed 
in Section 9.3, the recognition rate was lower, but still significantly higher than that of 
other methods. The high performance of non-robust gSIM is important as an estimate 
of the expected recognition rate in the “still-to-video” scenario of the proposed method. 
We conclude that our algorithm’s performance seems very promising in this setup as well. 
An inspection of the Receiver-Operator Characteristics Figure 9.13 (a) of the two methods 
shows an ever more drastic improvement. This is an insightful observation: it shows that 
the use of the proposed robust likelihood yields less variation in the estimated similarity 
between individuals across different sequences. 

Finally, note that the standard deviation of our algorithm’s performance across different 
training and test illuminations is much lower than that of other methods, showing less 
dependency on the exact imaging conditions used for data acquisition. 

Representations. Both the high-pass and even further Self Quotient Image representa¬ 
tions produced an improvement in recognition for all methods over the raw grayscale. This 
is consistent with previous findings in the literature [Adi97, AraOSc, Fit02, Wan04a]. 

However, unlike in previous reports of performance evaluation of these filters, we also ask 
the question of when they help and how much in each case. To quantify this, consider “per- 
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Table 9.1: Average recognition rates (%) and their standard deviations (if applicable). 



gSIM, rob. 

gSlM 

Facelt 

CMSM 

KPA 

MSM 

KLD 

CamFaee 








X 

99.7/0.8 

97.7/2.3 

64.1/9.2 

73.6/22.5 

63.1/21.2 

58.3/24.3 

17.0/8.8 


- 

- 

- 

85.0/12.0 

83.1/14.0 

82.8/14.3 

35.4/14.2 

Xq 

- 

- 

- 

87.0/11.4 

87.1/9.0 

83.4/8.4 

42.8/16.8 

ToshFace 








X 

99.9/0.5 

96.7/5.5 

81.8/9.6 

79.3/18.6 

49.3/25.0 

46.6/28.3 

23.0/15.7 

Xh 

- 

- 

- 

83.2/17.1 

61.0/18.9 

56.5/20.2 

30.5/13.3 

Xq 

- 

- 

- 

91.1/8.3 

87.7/11.2 

83.3/10.8 

39.7/15.7 

Face Video 








X 

100.0 

91.9 

91.9 

91.9 

91.9 

81.8 

59.1 

Xh 

- 

- 

- 

100.0 

91.9 

81.8 

63.6 

Xq 

- 

- 

- 

91.9 

91.9 

81.8 

63.6 



(a) (b) 


Figure 9.13: (a) The Receiver-Operator Characteristic (ROC) curves of the gSIM method, 
with and without the robust likelihood proposed in Section 9.3 estimated from CamFaee and 
ToshFace data sets. 
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formance vectors” sn and sp, corresponding to respectively raw and filtered input, whose 
each component is equal to the recognition rate of a method on a particular training/test 
data combination. Then the vector As^ = sr — sr contains relative recognition rates to its 
average on raw input, and As = sr — sr the improvement with the filtered representation. 
We then considered the angle (j) between vectors Asr and As, using both the high-pass and 
Self Quotient Image representations. In both cases, we found the angle to be (/> ~ 136°. 

This is an interesting result: it means that while on average both representations in¬ 
crease the recognition rate, they actually worsen it in “easy” recognition conditions. The 
observed phenomenon is well understood in the context of energy of intrinsic and extrinsic 
image differences and noise (see [WanOSa] for a thorough discussion). Higher than average 
recognition rates for raw input correspond to small changes in imaging conditions between 
training and test, and hence lower energy of extrinsic variation. In this case the training and 
test data sets are already normalized to have the same illumination and the two filters can 
only decrease the signal-to-noise ratio, thereby worsening the recognition performance. On 
the other hand, when the imaging conditions between training and test are very different, 
normalization of extrinsic variation is the dominant factor and the performance is improved. 

This is an important observation, as it suggests that the performance of a method that 
uses either of the representations can be increased further in a very straightforward manner 
by detecting the difficulty of recognition conditions. This is exploited in [Ara06f|. 

Imaging conditions. We were interested if the evaluation results on our database sup¬ 
port the observation in the literature that some illumination conditions are intrinsically 
more difficult for recognition than others [Sim04]. An inspection of the performance of the 
evaluated methods has shown a remarkable correlation in relative performance across illu¬ 
minations, despite the very different models used for recognition. We found that relative 
recognition rates across illuminations correlate on average with p = 0.96. 

Faces and individuals. Finally, in the similar manner as previously for different illu¬ 
mination conditions, we were interested to see if certain individuals were more difficult for 
recognition than others. In other words, are incorrect recognitions roughly equally dis¬ 
tributed across the database, or does a relatively small number of people account for most? 
Our robust algorithm failed in too few cases to make a statistically significant observation, 
so we instead looked at the performance of the non-robust gSIM which failed at about an 
order of magnitude greater frequency. 

A histogram of recognition errors across individuals in ToshFace data set is shown Fig¬ 
ure 9.14 (a), showing that most errors were indeed repeated. It is difficult to ascertain if this 
is a consequence of an inherent similarity between these individuals or a modelling limita- 
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(a) Repeated misclassifications (b) Examples 


Figure 9.14: {aj A histogram of non-robust gSIM recognition failures across individuals in 
the ToshFace data set. The majority of errors are repeated, of which two of the most common 
ones are shown in (b). Visual inspection of these suggests that these individuals are indeed 
inherently similar in appearance. 


tion of our algorithm. A subjective qualitative inspection of the individuals most commonly 
confused, shown in Figure 9.14 (b), tends to suggest that the former is the dominant cause. 

Computational complexity. We conclude this section with an analysis of the compu¬ 
tational demands of our algorithm. We focus on the online, novel sequence recognition (see 
Section 9.3), as this is of most practical interest. It consists of the following stages (at this 
point the reader may find it useful to refer back to the summary in Figures 9.10 and 9.11): 

1. iG-nearest neighbour computation for each face, 

2. geodesic distance estimation for all pairs of faces, 

3. genetic algorithm optimization, 

4. hne reillumination of all faces, and 

5. robust likelihood computation. 

We use the following notation: N is the number of frames in a sequence, D the number of 
face pixels, K the number of frames used in fine reillumination, A^gen the number of genetic 
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Algorithm stage 

Asymptotic complexity 

AT-nearest neighbours (see Section 9.1.1) 

N{ND-\-N log N) 

geodesic distances (see Section 9.1.1) 

-l-NK 

genetic algorithm (see Section 9.1.1) 

Ng,„NchriND + NK) 

fine reillumination (see Section 9.1.1) 

NK^ 

robust likelihood (see Section 9.3) 

NN.ompD^ + N log N 

(a) Asymptotic 



(b) Measured 

Figure 9.15: (a) Asymptotic complexity of different stages of the proposed online, novel 
sequence recognition algorithm and (b) the actual measured times for our Matlab implemen¬ 
tation. 


algorithm generations, Nchr the number of chromosomes in each generation and Ncomp the 
number of Gaussian components in the Generic SIM GMM. 

For each face, the JF-nearest neighbour computation consists of computing its dis¬ 
tances to all other faces, 0{DN), and ordering them to find the nearest K, 0{N \ogN). 
The estimation of geodesic distances involves initialization, 0{NK), and an application of 
Floyd’s algorithm, 0{N^). In a generation of the genetic algorithm, for each chromosome 
we compute the similarity of all pose-signatures, 0{ND), and look-up geodesic distances 
in all iF-neighbourhoods, 0{NK). Finally, robust likelihoods are computed for all faces, 
0{NcompD‘^), which are then ordered, 0{N log N). Treating everything but iV as a constant, 
the overall asymptotic complexity of the algorithm is 0{N^). A summary is presented in 
Figure 9.15 (a). 

We next profiled our implementation of the algorithm. It should be stressed that this 
code was written in Matlab an consequently the running times reported are not indicative of 
its actual practicability. In all experiments only the number of faces per sequence was varied: 
we used N = 25,50,100,200,400 and 800 faces. Mean computation times for different 
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stages of the algorithm are plotted in Fig 9.15 (b). In this range of N, the measured 
asymptote slopes were typically lower than predicted, which was especially noticeable for 
the most demanding computations (e.g. of geodesic distances). The most likely reason for 
this phenomenon are large constants associated with Matlab’s /or-loops and data allocation 
routines. 


9.5 Summary and conclusions 

In this chapter we described a novel algorithm for face recognition that uses video to achieve 
invariance to illumination, pose and user motion pattern variation. We introduced the 
concept of the Generic Shape-Illumination manifold as a model of illumination effects on 
faces and showed how it can be learnt offline from a small training corpus. This was made 
possible by the proposed “reillumination” algorithm which is used extensively both in the 
offline and online stages of the method. 

Our method was demonstrated to achieve a nearly perfect recognition on 3 databases 
containing extreme variation in acquisition conditions. It was compared to and has sig¬ 
nificantly outperformed state-of-the-art commercial software and methods in the literature. 
Furthermore, an analysis of a large-scale performance evaluation (i) showed that the method 
is promising for image-to-sequence matching, (ii) suggested a direction of research to improve 
image filtering for illumination invariance, and (iii) confirmed that certain illuminations and 
individuals are inherently particularly challenging for recognition. 

There are several avenues for future work that we would like to explore. Firstly, we would 
like to make further use of offline training data, by constructing the gSIM while taking into 
account probabilities of both intra- and inter-personal differences. Additionally, we would 
like to improve the computational efficiency of the method, e.g. by representing each FMM 
by a strategically chosen set of sparse samples. Finally, we are evaluating the performance 
of image-to-sequence matching and looking into increasing its robustness, in particular to 
pose. 


Related publications 

The following publications resulted from the work presented in this chapter: 

• O. Arandjelovic and R. Cipolla. Face recognition from video using the generic shape- 
illumination manifold. In Proc. IEEE European Conference on Computer Vision^ 
4:pages 27-40, May 2006. [Ara06b] 
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• O. Arandjelovic and R. Cipolla. Achieving robust face recognition from video by 
combining a weak photometric model and a learnt generic face invariant. Pattern 
Recognition, 46(l):9-23, January 2013. [Aral3] 
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The preceding chapters concentrated on the user authentication paradigm of face recog¬ 
nition. The aim was to reliably compare two video sequences of random head motion 
performed by the users. In contrast, the objective of this work is to recognize all faces of a 
character in the closed world of a movie or situation comedy. 

This is challenging because faces in a feature-length film are relatively uncontrolled with 
a wide variability of scale, pose, illumination, and expressions, and also may be partially 
occluded. Furthermore, unlike in the previous chapters, a continuous video stream does not 
contain a face of only a single person, which increases the difficulty of data extraction. In 
this chapter recognition is performed given a small number of query faces, all of which are 
specified by the user. 

We develop and describe a recognition method based on a cascade of processing steps 
that normalize for the effects of the changing imaging environment. In particular there 
are three areas of novelty: (i) we suppress the background surrounding the face, enabling 
the maximum area of the face to be retained for recognition rather than a subset; (ii) we 
include a pose refinement step to optimize the registration between the test image and face 
exemplar; and (iii) we use robust distance to a sub-space to allow for partial occlusion and 
expression change. 

The method is applied and evaluated on several feature length films. It is demonstrated 
that high recall rates (over 92%) can be achieved whilst maintaining good precision (over 
93%). 


10.1 Introduction 

We consider face recognition for content-based multimedia retrieval: our aim is to retrieve, 
and rank by confidence, film shots based on the presence of specific actors. A query to 
the system consists of the user choosing the person of interest in one or more keyframes. 
Possible applications include: 

1. DVD browsing: Current DVD technology allows users to quickly jump to the chosen 
part of a film using an on-screen index. However, the available locations are prede¬ 
fined. Face recognition technology could allow the user to rapidly browse scenes by 
formulating queries based on the presence of specific actors. 

2. Content-based web search: Many web search engines have very popular image 
search features (e.g. http://www.google.co.uk/imghp). Currently, the search is per¬ 
formed based on the keywords that appear in picture filenames or in the surrounding 
web page content. Face recognition can make the retrieval much more accurate by 
focusing on the content of images. 
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Figure 10.1: Automatically detected faees in a typical frame from the feature-length film 
“Groundhog day”. The background is cluttered, pose, expression and illumination very vari¬ 
able. 


As before, we proceed from the face detection stage, assuming localized faces. We use 
a local implementation of the method of Schneiderman and Kanade [SchOO] and consider 
a face to be correctly detected if both eyes and the mouth are visible, see Figure 10.1. In 
a typical feature-length film, using every 10th frame, we obtain 2000-5000 face detections 
which result from a cast of 10-20 primary and secondary characters (see Section 10.3). 

Method overview. Our approach consists of computing a numerical value, a distance, 
expressing the degree of belief that two face images belong to the same person. Low distance, 
ideally zero, signifies that images are of the same person, whilst a large one signifies that 
they are of different people. 

The method involves computing a series of transformations of the original image, each 
aimed at removing the effects of a particular extrinsic imaging factor. The end result is a 
signature image of a person, which depends mainly on the person’s identity (and expression, 
see Section 10.2.5) and can be readily classified. The preprocessing stages of our algorithm 
are summarized in Figure 10.4 and Figure A.2. 

10.1.1 Previous work 

Most previous work on face recognition focuses on user authentication applications, few 
authors addressing it in a setup similar to ours. Fitzgibbon and Zisserman [Fit02] inves¬ 
tigated face clustering in feature films, though without explicitly using facial features for 
registration. Berg et al. [Ber04] consider the problem of clustering detected frontal faces 
extracted from web news pages. In a similar manner to us, affine registration with an un¬ 
derlying SVM-based facial feature detector is used for face rectification. The classification 
is then performed in a Kernel PCA space using combined image and contextual text-based 
features. The problem we consider is more difficult in two respects: (i) the variation in 
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(a) (b) (c) 


Figure 10.2: The effects of imaging conditions - illumination (a), pose (b) and expression 
(c) - on the appearance of a face are dramatic and present the main difficulty to automatic 
face recognition. 


imaging conditions in films is typically greater than in newspaper photographs, and (ii) we 
do not use any type of information other than visual cues (i.e. no text). The difference in the 
difficulty is apparent by comparing the examples in [Ber04] with those used for evaluation in 
Section 10.3. For example, in [Ber04] the face image size is restricted to be at least 86 x 86 
pixels, whilst a significant number of faces we use are of lower resolution. 

Everingham and Zisserman [Eve04] consider face recognition in situation comedies. How¬ 
ever, rather than using facial feature detection, a quasi-3D model of the head is used to cor¬ 
rect for varying pose. Temporal information via shot tracking is exploited for enriching the 
training corpus. In contrast, we do not use any temporal information, and the use of local 
features (Section 10.2.1) allows us to compare two face images in spite of partial occlusions 
(Section 10.2.5). 


10.2 Method details 

In the proposed framework, the first step in processing a face image is the normalization 
of the subject’s pose i.e. registration. After the face detection stage, faces are only roughly 
localized and aligned - more sophisticated registration methods are needed to correct for the 
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Input; novel face image I, 

training signature image S^. 
Output: distance d(I, Sr). 


1: Facial feature localization 

{xi} = features(I) 

2: Pose effects registration by affine warping 

1r = affine_warp (I, {x^}) 

3: Background clutter face outline detection 
li? = li?. + mask(Ifl;) 

4: Illumination effects band-pass filtering 

S = I^. * B 

5: Pose effects registration refinement 

Sf = warp_using_appearance(lF, Sr) 

6: Occlusion effects robust distance measure 

d(I, Sr) = distance(Sr, Sf) 


Figure 10.3: A summary of the main steps of the proposed algorithm. A novel, ‘input’ 
image I is first preprocessed to produce a signature image Sf, which is then compared with 
signatures of each ‘reference ’ image Sr. The intermediate results of preprocessing are also 
shown in Figure 10.4-. 


appearance effects of varying pose. One way of doing this is to “lock onto” characteristic 
facial points and warp images to align them. In our method, these facial points are the 
locations of the mouth and the eyes. 

10.2.1 Facial feature detection 

In the proposed algorithm Support Vector Machines^ (SVMs) [Bur98, Sch02] are used for 
facial feature detection. A related approach was described in [Ber04]; alternative methods 
include pictorial structures [FelOS], shape+appearance cascaded classifiers [Fuk98] and the 
method of Cristinacce et al. [Cri04]. 

We represent each facial feature, i.e. the image patch surrounding it, by a feature vector. 
An SVM with a set of parameters (kernel type, its bandwidth and a regularization constant) 

^We used the LibSVM implementation freely available at http: //www. csie.ntu. edu. tw/~ cjlin/libsvm/ 
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Figure 10.4: Each step in the proposed preprocessing cascade produces a result invariant to 
a specific extrinsic imaging factor. The result is a ‘signature’ image of a person’s face. 


is then trained on a part of the training data and its performance iteratively optimized on 
the remainder. The final detector is evaluated by a one-time run on unseen data. 


Training 

For training we use manually localized facial features in a set of 300 randomly chosen faces 
from the feature-length film “Groundhog day” and the situation comedy “Fawlty Towers”. 
Examples are extracted by taking rectangular image patches centred at feature locations 
(see Figures 10.5 and 10.6). We represent each patch I G with a feature vector 

V G containing appearance and gradient information (we used H = 11 and W = 21 

for a face image of the size 81 x 81 - the units being pixels): 


VA(Wy + x) = I(x,y) 

VG(Wy + x) = |V/(a:, y)| 

VA 

V = 

VG 


( 10 . 1 ) 

( 10 . 2 ) 

(10.3) 


Local information. In the proposed method, implicit local information is included for 
increased robustness. This is done by complementing the image appearance vector va with 
the greyscale intensity gradient vector vg, as in equation (10.3). 

Synthetic data. For robust classification, it is important that training data sets are rep¬ 
resentative of the whole spaces that are discriminated between. In uncontrolled imaging 
conditions, the appearance of facial features exhibits a lot of variation, requiring an appro¬ 
priately large training corpus. This makes the approach with manual feature extraction 
impractical. In our method, a large portion of training data (1500 out of 1800 training 
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Figure 10.5: Without surrounding image context, distinguishing features in low resolution 
and severe illumination conditions is a hard task even for a human. Shown are a mouth 
and an eye that although easily recognized within the context of the whole image, are very 
similar in isolation. 




Figure 10.6: A subset of the data (1800 features were used in total) used to train the SVM- 
hased eye detector. Notice the low resolution and the importance of the surrounding image 
context for precise localization (see Figure 10.5). 
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Figure 10.7: A summary of the efficient SVM-based eye detection: 1: Prior on feature 
location restricts the search region. 2: Only ^ 25% of the locations are initially classified. 
3: Morphological dilation is used to approximate the dense classification result from a sparse 
output. 4- The largest prior-weighted cluster is chosen as containing the feature of interest. 


examples) was synthetically generated. Seeing that the surface of the face is smooth and 
roughly fronto-parallel, its 3D motion produces locally affine-like effects in the image plane. 
Therefore, we synthesize training examples by applying random affine perturbations to the 
manually detected set (for similar approaches to generalization from a small amount of 
training data see [Ara06e, Mar02, SunOS]). 

SVM-based feature detector 

SVMs only provide classification decision for individual feature vectors, but no associated 
probabilistic information. Therefore, performing classification on all image patches produces 
as a result a binary image (a feature is either present or not in a particular location) from 
which only a single feature location is to be selected. 

Our method is based on the observation that due to the robustness to noise of SVMs, 
the binary image output consists of connected components of positive classifications (we will 
refer to these as clusters)^ see Figure 10.7. We use a prior on feature locations to focus on the 
cluster of interest. Priors corresponding to the three features are assumed to be independent 
and Gaussian (2D, with full covariance matrices) and are learnt from the training corpus 
of 300 manually localized features described in Section 10.2.1. We then consider the total 
‘evidence’ for a feature within each cluster: 

e{S) = f P(x)dx (10.4) 

J xG*S 

where 5 is a cluster and F’(x) the Gaussian prior on the facial feature location. An unbiased 
feature location estimate with cr Ri 1.5 pixels was obtained by choosing the mean of the 
cluster with largest evidence as the final feature location. Intermediate results of the method 
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Figure 10.8: High accuracy in automatic detection of facial features is achieved in spite of 
wide variation in facial expression, pose, illumination and the presence of facial wear (e.g. 
glasses and makeup). 


are shown in Figure 10.7, while Figure 10.8 shows examples of detected features. 

10.2.2 Registration 

In the proposed method dense point correspondences are implicitly or explicitly used in sev¬ 
eral stages: for background clutter removal, partial occlusion detection and signature image 
comparison (Section 10.2.3-10.2.5). To this end, images of faces are affine warped to have 
salient facial features aligned with their mean, canonical locations. The six transformation 
parameters are uniquely determined from three pairs of point correspondences - between 
detected facial features (the eyes and the mouth) and this canonical frame. In contrast 
to global appearance-based methods (e.g. [Bla99, Edw98a]) our approach is more robust to 
partial occlusion. It is summarized in Figure 10.9 with typical results shown in Figure 10.10. 
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Input: canonical facial feature locations Xcon, 

face image I, 

facial feature locations x^. 

Output: registered image Ireg- 


1: Estimate the affine warp matrix 

A = affine_from_correspondences(xcora, Xm) 

2: Compute eigenvalues of A 

{Ai,A 2 } = eig(A) 

3: Impose prior on shear and rescaling by A 

if (|A| G [0.9,1.1] A A 1 /A 2 G [0.6,1.3]) then 

4: Warp the image 

Ireg = affine_warp(I; A) 


else 

5: Face detector false +ve 


Figure 10.9: A summary of the proposed facial feature-based registration of faces and removal 
of face detector false positives. 
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Figure 10.10: Original (top) and corresponding registered images (bottom). The eyes and 
the mouth in all registered images are at the same, canonical locations. The effects of affine 
transformations are significant. 


10.2.3 Background removal 

The bounding box of a face, supplied by the face detector, typically contains signihcant 
background clutter and affine registration boundary artefacts, see Figure 10.10. To realize 
a reliable comparison of two faces, segmentation to foreground (i.e. face) and background 
regions has to be performed. We show that the face outline can be robustly detected by 
combining a prior on the face shape, learnt offline, and a set of measurements of intensity 
discontinuity in an image of a face. The proposed method requires only grey level infor¬ 
mation, performing equally well for colour and greyscale input, unlike previous approaches 
which typically use skin colour for segmentation (e.g. [Ara05b]). 

In detecting the face outline, we only consider points confined to a discrete mesh cor¬ 
responding to angles equally spaced at Aa and radii at Ar, see Figure 10.11 (a); in our 
implementation we use Aa = 27r/100 and Ar = 1. At each mesh point we measure the 
image intensity gradient in the radial direction - if its magnitude is locally maximal and 
greater than a threshold t, we assign it a constant high-probability and a constant low prob¬ 
ability otherwise, see Figure 10.11 (a,b). Let mi be a vector of probabilities corresponding 
to discrete radius values at angle a^ = *Aa, and r^ the boundary location at the same angle. 
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Figure 10.11: Markov chain observations: (a) A discrete mesh in radial coordinates (only 
10% of the points are shown for clarity) to which the boundary is confined. Also shown 
is a single measurement of image intensity in the radial direction and the detected high 
probability points. The plot of image intensity along this direction is shown in (b) along 
with the gradient magnitude used to select the high probability locations. 


We seek the maximum a posteriori estimate of the boundary radii: 


{ri} =argmaxP(ri, ..,r 7 v|mi, ..,mAr) = (10.5) 

{rii 

argmaxP(mi,mAr|ri, rN)P{ri,r^), where N = 27r/Aa. (10.6) 

{'■i} 

We make the Naive Bayes assumption for the first term in equation (10.5), whereas, 
exploiting the observation that surfaces of faces are mostly smooth, for the second term we 
assume to be a first-order Markov chain. Formally: 

N N 

P(mi, ..,mAr|r-i, ..,r7v) = ]^P(m,;|ri) = ]^mi(rj) 

i=l i^l 

N 

P(ri, ..,riv) = P(ri) ]^P(ri|ri_i) 

i=2 

In our method model parameters (priors and likelihoods) are learnt from 500 manually 
delineated face outlines. The application of the model by maximizing expression in (10.5) is 
efficiently realized using dynamic programming i.e. the well-known Viterbi algorithm [Gri92]. 


(10.7) 

( 10 . 8 ) 


Feathering. The described method of segmentation of face images to foreground and 
background produces as a result a binary mask image M. As well as masking the corre¬ 
sponding registered face image I_r (see Figure 10.12), we smoothly suppress image informa¬ 
tion around the boundary to achieve robustness to small errors in its localization. This is 
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Figure 10.12: Original image, image with detected face outline, and the resulting image with 
the background masked. 



Figure 10.13: Original images of detected and affine-registered faces and the result of the 
proposed segmentation algorithm. Subtle variations of the face outline caused by different 
poses and head shapes are handled with high precision. 


often referred to as feathering: 

Mf = M*exp- (10.9) 

Ipix, y) = Ir{x, y)MF{x, y) (10.10) 

Examples of segmented and feathered faces are shown in Figure 10.13. 


10.2.4 Compensating for changes in illumination 

The last step in processing of a face image to produce its signature is the removal of illu¬ 
mination effects. As the most signihcant modes of illumination changes are rather coarse 
- ambient light varies in intensity, while the dominant illumination source is either frontal, 
illuminating from the left, right, top or bottom (seldom) - and noting that these produce 
mostly slowly varying, low spatial frequency variations [Fit02] (also see Section 2.3.2), we 
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normalize for their effects by band-pass filtering, see Figure 10.4: 

S = If * Git=o.5 — If * Go.=8 (10.11) 

This defines the signature image S. 

10.2.5 Comparing signature images 

In Section 10.2.1-10.2.4 a cascade of transformations applied to face images was described, 
producing a signature image insensitive to illumination, pose and background clutter. We 
now show how the accuracy of facial feature alignment and the robustness to partial occlusion 
can be increased further when two signature images are compared. 

Improving registration 

In the registration method proposed in Section 10.2.2, the optimal affine warp parameters 
were estimated from three point correspondences in 2D. Therefore, the 6 degrees of freedom 
of the affine transformation were uniquely determined, making the estimate sensitive to 
facial feature localization errors. To increase the accuracy of registration, we propose a 
dense, appearance-based affine correction to the already computed feature correspondence- 
based registration. 

In our algorithm, the corresponding characteristic regions of two faces, see Figure 10.14 (a),| 
are perturbed by small translations to find the optimal residual shift (i.e. that which gives 
the highest normalized cross-correlation score between the two overlapping regions). These 
new point correspondences now overdetermine the residual affine transformation (which we 
estimate in the least L 2 norm of the error sense) that is applied to the image. Some results 
are shown in Figure 10.14. 

Distance 

Single query image. Given two signature images in precise correspondence (see above). 

Si and S 2 , we compute the following distance between them: 

ds(Si,S2) = EE h{Si{x,y) - S 2 {x,y)) (10.12) 

X y 

where h{AS) = (AS”)^ if the probability of occlusion at {x, y) is low and a constant value 
k otherwise. This is effectively the L 2 norm with added outlier (e.g. occlusion) robustness, 
similar to [Bla98]. We now describe how this threshold is determined. 
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(d) (e) (f) 


Figure 10.14: Pose refinement summary: (a) Salient regions of the face used for appearance- 
based computation of the residual ajfine registration. (b)(c) Images aligned using feature 
correspondences alone, (d) The salient regions shown in (a) are used to refine the pose of 
(b) so that it is more closely aligned with (c). The residual rotation between (b) and (c) is 
removed. This correction can be seen clearly in the difference images: (e) is |Sc — and 

(f) IS |Se-Sd|. 


Partial occlusions. Occlusions of imaged faces in films are common. Whilst some re¬ 
search has addressed detecting and removing specific artefacts only, such as glasses [JinOO], 
here we give an alternative non-parametric approach, and use a simple appearance-based 
statistical method for occlusion detection. Given that the error contribution at {x, y) is 
e = AS{x,y), we detect occlusion if the probability Ps{e) that e is due to inter- or intra¬ 
personal differences is less than 0.05. Pixels are classified as occluded or not on an in¬ 
dependent basis. Psis) is learnt in a non-parametric fashion from a face corpus with no 
occlusion. 

The proposed approach achieved a reduction of 33% in the expected within-class signa¬ 
ture image distance, while the effect on between-class distances was found to be statistically 
insignihcant. 


Multiple query images. The distance introduced in equation (10.12) gives the confi¬ 
dence measure that two signature images correspond to the same person. Often, however. 
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more than a single image of a person is available as a query: these may be supplied by the 
user or can be automatically added to the query corpus as the highest ranking matches of 
a single image-based retrieval. In either case we want to be able to quantify the confidence 
that the person in the novel image is the same as in the query set. 

Seeing that the processing stages described so far greatly normalize for the effects of 
changing pose, illumination and background clutter, the dominant mode of variation across 
a query corpus of signature images {Si} can be expected to be due to facial expression. We 
assume that the corresponding manifold of expression is linear, making the problem that 
of point-to-subspace matching [Bla98]. Given a novel signature image Sjv we compute a 
robust distance: 

dG({SJ,Siv) = ds(FF^Siv,Siv) (10.13) 

where F is orthonormal basis matrix corresponding to the linear subspace that explains 95% 
of energy of variation within the set {Si}. 


10.3 Empirical evaluation 

The proposed algorithm was evaluated on automatically detected faces from the situation 
comedy “Fawlty Towers” (“A touch of class” episode), and feature-length films “Groundhog 
Day” and “Pretty Woman” Detection was performed on every 10th frame, producing 
respectively 330, 5500, and 3000 detected faces (including incorrect detections). Face images 
(frame regions within bounding boxes determined by the face detector) were automatically 
resized to 80 x 80 pixels, see Figure 10.17 (a). 

10.3.1 Evaluation methodology 

Empirical evaluation consisted of querying the algorithm with each image in turn (or image 
set for multiple query images) and ranking the data in order of similarity to it. Two ways of 
assessing the results were employed - using Receiver Operator Gharacteristics (ROG) and 
the rank ordering score introduced in Section 2.4. 

10.3.2 Results 

Typical Receiver Operator Characteristic curves obtained with the proposed method are 
shown in Figure 10.15 (a, b). These show that excellent results are obtained using as 
little as 1-2 query images, typically correctly recalling 92% of the faces of the query person 
with only 7% of false retrievals. As expected, more query images produced better retrieval 

^Available at http: //www. robots. ox. ac.uk/~vgg/data/ 
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accuracy, also illustrated in Figure 10.15 (e, f). Note that as the number of query images is 
increased, not only is the ranking better on average but also more robust, as demonstrated 
by a decreased standard deviation of rank order scores. This is very important in practice, 
as it implies that less care needs to be taken by the user in the choice of query images. 
For the case of multiple query images, we compared the proposed subspace-based matching 
with the k-nearest neighbours approach, which was found to consistently produce worse 
results. The improvement of recognition with each stage of the proposed algorithm is shown 
in Figure 10.16. 

Example retrievals are shown in Figures 10.17-10.19. Only a single incorrect face is 
retrieved in the first 50, and this is with a low matching confidence (i.e. ranked amongst 
the last in the retrieved set). Notice the robustness of our method to pose, expression, 
illumination and background clutter. 

10.4 Summary and conclusions 

In this chapter we introduced a content-based film-shot retrieval system driven by a novel 
face recognition algorithm. The proposed approach of systematically removing particular 
imaging distortions - pose, background clutter, illumination and partial occlusion has been 
demonstrated to consistently achieve high recall and precision rates on several well-known 
feature-length films and situation comedies. 


Related publications 

The following publications resulted from the work presented in this chapter: 

• O. Arandjelovic and A. Zisserman. Automatic face recognition for film character 
retrieval in feature-length films. In Proc. IEEE Conference on Computer Vision and 
Pattern Recognition, lipages 860-867, June 2005. [Ara05c] 

• O. Arandjelovic and A. Zisserman. Interactive video: Algorithms and Technolo¬ 
gies., chapter On Film Character Rretrieval in Feature-Length Films., pages 89-103. 
Springer-Verlag, 2006. ISBN 978-3-540-33214-5. [Ara06i] 
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(e) (f) 

Figure 10.15: (a, b) ROC curves for the retrieval of Basil (c) and Sybil (d) in “Fawlty 
Towers”. The corresponding rank ordering scores across 35 retrievals are shown in (e) and 
(f), sorted for the ease of visualization. 
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Figure 10.16: The average rank ordering score of the baseline algorithm and its improvement 
as each of the proposed processing stages is added. The improvement is demonstrated both in 
the increase of the average score, and also in the decrease of its standard deviation averaged 
over different queries. Finally, note that the averages are brought down by few very difficult 
queries, which is illustrated well in Figure 10.15 (e,f). 



(a) (b) 


Figure 10.17: (a) The “Pretty Woman” data set - every 50th detected face is shown for 
compactness. Typical retrieval result is shown in (b) ~ query images are outlined by a solid 
line, the incorrectly retrieved face by a dashed line. The performance of our algorithm is 
very good in spite of the small number of query images used and the extremely difficult data 
set - this character frequently changes wigs, makeup and facial expressions. 



(a) (b) 


Figure 10.18: (a) The “Fawlty Towers” data set - every 30th detected face is shown for 
compactness. Typical retrieval result is shown in (b) - query images are outlined. There are 
no incorrectly retrieved faces in the top 50. 
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(c) 


Figure 10.19: (a) The “Groundhog Day” data set - every 30th detected face is shown for 
compactness. Typical retrieval results are shown in (b) and (c) ~ query images are outlined. 
There are no incorrectly retrieved faces in the top 50. 
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In this chapter we continue looking at faces in feature-length films. We consider the 
most difficult recognition setting of all - fully automatic (i.e. without any dataset-specific 
training information) listing of the individuals present in a video. In other words, our goal is 
to automatically determine the cast of a feature-length film without any user intervention. 

The main contribution of this chapter is an algorithm for clustering over face appearance 
manifolds themselves. Specifically: (i) we develop a novel algorithm for exploiting coherence 
of dissimilarities between manifolds, (ii) we show how to estimate the optimal dataset-specific 
discriminant manifold starting from a generic one, and (iii) we describe a fully automatic, 
practical system based on the proposed algorithm. 

We present the results of preliminary empirical evaluation which show a vast improve¬ 
ment over traditional pattern clustering methods in the literature. 


11.1 Introduction 

The problem that we address in this chapter is that of automatically determining the cast 
of a feature-length film. This is a far more difficult problem than that of character retrieval. 
However, it is also more appealing from a practical standpoint: the method we will describe 
can be used to pre-compute and compactly store identity information for an entire film, 
rendering any retrieval equivalent to indexing a look-up table and, consequently, extremely 
fast. 

The hrst idea of this chapter concerns the observation that some people are inherently 
more similar looking to each other than others. As an example from our data set, in 
certain imaging conditions Sir Hacker may be difficult to distinguish from his secretary, Sir 
Humphrey, see respectively Figures 11.1 and 11.10 in Section 11.2.3). However, regardless 
of the imaging setup he is unlikely to be mistaken, say, for his wife, see Figure 11.5 in 
Section 11.2.1. The question is then how to automatically extract and represent the structure 
of these inter-personal similarities from unlabelled sets of video sequences. We show that 
this can be done by working in what we term the manifold space - a vector space in which 
each point is an appearance manifold. 

The second major contribution of this chapter is a method for unsupervised extraction 
of inter-class data for discriminative learning on an unlabelled set of video sequences. In 
spirit, this approach is similar to the work of Lee and Kriegman [Lee05] in which a generic 
appearance manifold is progressively updated with new data to converge to a person-specific 
one. In contrast, we start from a generic discriminative manifold and converge to a data- 
specific one, automatically collecting within-class data. 

An overview of the entire system is shown in Figure A.2. 
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Figure 11.1: The appearance of faces in films exhibits great variability depending on the 
extrinsic imaging conditions. Shown are the most common sources of intra-personal appear¬ 
ance variations (all faces are from the same episode of the situation comedy “Yes, Minis¬ 
ter”). 


11.2 Method details 

In this section we describe each of the steps in the algorithmic cascade of the proposed 
method: (i) automatic data acquisition and preprocessing, (ii) unsupervised discriminative 
learning and (iii) clustering over appearance manifolds. 

11.2.1 Automatic data acquisition 

Our cast clustering algorithm is based on pair-wise comparisons of face manifolds [AraOSb, 
LeeOS, Mog02] that correspond to sequences of moving faces. Hence, the first stage of the 
proposed method is automatic acquisition of face data from a continuous feature-length film. 
We (i) temporally segment the video into shots, (ii) detect faces in each and, finally, (iii) 
collect detections through time by tracking in the {X, Y, scale) space. 

Shot transition detection. A number of reliable methods for shot transition detection 
have been proposed in the literature [Ham95, Ots93, Zab95, Zha93]. We used the Edge 
Change Ratio (ECR) [Zab95] algorithm as it is able in a unified manner to detect all 3 
standard types of shot transitions: (i) hard cuts, (ii) fades and (iii) dissolves. The ECR is 
defined as: 


ECRn = max(A;7a„, W77a„_i) (11.1) 

where tT„ is the number of edge pixels computed using the Canny edge detector [Can86], 
and and A°“* the number of entering and existing edge pixels in frame n. Shot changes 
are then recognized by considering local peaks of ECRn, exceeding a threshold, see [Lie98, 
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Input: film frames {ft}, 

generic discrimination subspace B^. 
Output: cast classes C. 


1: Data extraction — face manifolds 

T = get_manifolds({/t}) 

2: Synthetically repopulate manifolds 

T = repopulate(T) 

3: Adaptive discriminative learning distance matrix 

D 5 = distance(T, Bg) 

4: Manifold space 

M = MDS(Ds) 

5: Initial classes 

C = classes(Ds) 

6: Anisotropic boundaries in manifold space 
for Ci,Cj € C 


7: PPG A models 

=PPCA(C„C„M) 


8: Merge clusters using Weighted Description Length 

ADL(i,j) < threshold ? merge(i,j, C) 


Figure 11.2: A summary of the main steps of the proposed algorithm. 


Zab95] for details and Figure 11.3 for an example. 

Face tracking through shots. We detect faces in cluttered scenes on an independent, 
frame-by-frame basis with the Viola-Jones [Vio04] cascaded algorithm^. For each detected 
face, the detector provides a triplet of the X and Y locations of its centre and scale s. 
In the proposed method, face detections are connected by tracks using a simple tracking 
algorithm in the 3D space x = (A, V, s). We employ a form of the Kalman filter in which 
observations are deemed completely reliable (i.e. noise-free) and the dynamic model is that 
of zero mean velocity [x] = 0 with a diagonal noise covariance matrix. A typical tracking 
result is illustrated in Figure 11.4 with a single face track obtained shown in Figure 11.5. 

^We used the freely available code, part of the Intel® OpenCV library. 
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Figure 11.3: The unsmoothed Edge Change Ratio for a 20s segment from the situation 
eomedy “Yes, Minister”. 



Figure 11.4: The X coordinate of detected faces (red dots) through time in a single shot and 
the resulting tracks connecting them (blue lines) as determined by our algorithm. 
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(a) Appearance 



(b) Appearance manifold 


Figure 11.5: A typical face track obtained using our algorithm. Shown are (a) the original 
images are detected by the face detector (rescaled to the uniform scale of 50 x 50 pixels) and 
(b) as points in the 3D principal component space with temporal connections. 


11.2.2 Appearance manifold discrimination 

Having collected face tracks from a film, we turn to the problem of clustering these sub¬ 
sequences by corresponding identity. Due to the smoothness of faces, each track corresponds 
to an appearance manifold [Ara05b, LeeOS, Mog02], as illustrated in Figure 11.5. We want 
to compare these manifolds and use the structure of the variation of dissimilarity between 
them to deduce which ones describe the same person. 


Data preprocessing. As in the previous chapter, as the first step in the comparison of 
two appearance manifolds we employ simple preprocessing on a frame-by-frame basis that 
normalizes for the majority of illumination effects and suppresses the background. If X is 
an image of a face, in the usual form of a raster-ordered pixel vector, we first normalize for 
the effects of illumination using a high-pass filter (previously used in [Ara05c, Fit02]) scaled 
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Input: manifolds T = {T^}. 

generic discrimination subspace Bq. 
Output: distance matrix D 5 . 


1: Distance matrix using generic discrimination 

Dg = distance(T, Bg) 

2: Provisional classes 

Ct = classes(DG) 

3: Data-specific discrimination space 

Bs = constraint^space(CT) 

4: Mixed discrimination space 

Be = combine_eigenspaces(B5, Bg) 

5: Distance matrix using data-specific discrimination 

D 5 = distance(T, Bg) 


Figure 11.6: From generic to data-specific discrimination - algorithm summary. 


by local image intensity: 


( 11 . 2 ) 

(11.3) 

(11.4) 


Xz. =X*G,=i.5 

Xh = X-Xl 

Xi{x, y) = Xh{x, y)fXL{x, y). 


This is similar to the Self-Quotient Image of Wang et al. [Wan04a]. The purpose of local 
scaling is to equalize edge strengths in shadowed (weak) and well-illuminated (strong) regions 
of the face. 

Background is suppressed with a weighting mask Mz?, produced by feathering (similar 
to [Ara05c]) the mean face outline M, as shown in Figure 11.7: 



(11.5) 


Xpix.y) = Xi{x,y)MF{x,y). 


( 11 . 6 ) 
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(a) Mask (b) Face 


Figure 11.7: (a) The mask used to suppress eluttered background in images of automati¬ 
cally detected faces, and (b) an example of a detected, unprocessed face and the result of 
illumination normalization and background suppression. 


Synthetic data augmentation. Many of the collected face tracks in films are short and 
contain little pose variation. For this reason, we automatically enrich the training data 
corpus by stochastically repopulating geodesic neighbourhoods of randomly drawn manifold 
samples. This is the same approach we used in Chapter 4 so here we only briefly summarize 
it for continuity. 

Under the assumption that the face to image space embedding function is smooth, 
geodesically close images correspond to small changes in imaging parameters (e.g. yaw or 
pitch). Hence, using the first-order Taylor approximation of the effects of a projective cam¬ 
era, the face motion manifold is locally topologically similar to the afhne warp manifold. 
The proposed algorithm then consists of random draws of a face image x from the data, 
stochastic perturbation of x by a set of affine warps {Aj} and finally, the augmentation of 
data by the warped images. 


Comparing normalized appearance manifolds 

For pair-wise comparisons of manifolds we employ the Constraint Mutual Subspace method 
(CMSM) [FukOS], based on principal angles between subspaces [Hot36, Oja83]. This choice 
is motivated by: (i) CMSM’s good performance reports in the literature [Ara05b, Fuk03], 
(ii) its computational efficiency [Bj673] and compact data representation, and (iii) its ability 
to extract the most similar modes of variation between two subspaces, see Chapter 5 for 
more detail. 

As in [Fuk03], we represent each appearance manifold by a minimal linear subspace 
it is embedded in ~ estimated using Probabilistic PCA [Tip99b]. The similarity of two 
such subspaces is then computed as the mean of their first 3 canonical correlations, after 
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Figure 11.8: A visualization of the basis of the linear constraint subspace, the most descrip¬ 
tive linear subspace (eigenspace using PC A [Tur91a]) and the most discriminative linear 
subspace in terms of within and between class scatter (LDA [Bel97]). 


the projection onto the constraint subspace - a linear manifold that attempts to maximize 
the separation (in terms of canonical correlations) between different class subspaces, see 
Figure 11.8. 

Computing the constraint subspace. Let {Bi} = Bi,...,BAr be orthonormal basis 
matrices representing subspaces corresponding to N different classes (cast members, in our 
case). Fukui and Yamaguchi [FukOS] compute the orthonormal basis matrix Be correspond¬ 
ing to the constraint subspace using PCA from: 

where and Ag are diagonal matrices with diagonal entries, respectively, greater or equal 
than 1 and less than 1. We modify this approach by weighting the contribution of the 
projection matrix B^ by the number of samples used to compute it. This way, a more 
robust estimate is obtained as subspaces computed from smaller amounts of data (i.e. with 
lower Signal-to-Noise Ratio) are de-emphasized: 

As) (b1) (11.8) 


Prom generic to data-specific discrimination. The problem of estimating Be lies 
in the fact that we do not know which appearance manifolds belong to the same class and 
which to different classes i.e. {BJ are unknown. We therefore start from a generic constraint 
subspace B^, computed offline from a large data corpus. For example, for the evaluation 
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reported in Section 11.3 we estimated B^, i = 1,, 100 using the CamFace data set (see 
Appendix C). 

Now, consider the Receiver-Operator Characteristic (ROC) curve of CMSM in Fig¬ 
ure 11.9, also estimated offline. The inherent tradeoff between recall and precision is clear, 
making it impossible to immediately draw class boundaries using the inter-manifold distance 
only. Instead, we propose to exploit the two marked salient points of the curve merely to 
collect data for the construction of the constraint subspace. Starting from an arbitrary man¬ 
ifold, the “high recall” point allows to confidently partition a part of the data into different 
classes. Then, using manifolds in each of the classes we can gather intra-class data using 
the “high precision” point. The collected class information can then be used to compute 
the basis of the data-specific constraint subspace. 

The problem in using the above dehned data-speciflc constraint subspace B^ is that 
it is constructed using only the easiest to classify data. Hence, it cannot be expected to 
discriminate well in difhcult cases, corresponding to the points on the ROC curve between 
“high precision” and “high recall”. To solve this problem, we do not substitute the data- 
specific for the generic constraint subspace, but iteratively combine the two based on our 
confidence 0.0<a<1.0in the former: 


Be = mix(Q;, 1 — a, B^, B^) 


(11.9) 


where a and (1 — a) are mixing weights. We used an eigenspace mixing algorithm similar to 
Hall et al. [HalOO]. The mixing confidence parameter a is determined as follows. Consider 
clustering appearance manifolds using each of the two salient points. The “high precision” 
point will give an overestimate Nh > N oi the number of classes N, while the “high recall” 
one an underestimate Ni < N. The closer Nh and Ni are, the more confident we can be 
that the constraint subspace estimate is good. Hence, we compute a as their normalized 
difference (which ensures that the condition 0.0 < a < 1.0 is satisfied): 


0=1 — 


M-1 


( 11 . 10 ) 


where M is the number of appearance manifolds. 


11.2.3 The manifold space 

In Section 11.2.2 we described how to preprocess and pairwise compare appearance mani¬ 
folds, optimally exploiting generic information for discriminating between human faces and 
automatically extracted data-specific information. One of the main premises of the proposed 
clustering method is that there is a structure to inter- and intra-personal distances between 
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Figure 11.9: The ROC curve of the Constraint Mutual Subspace Method, estimated offline. 
Shown are two salient points of the curve, corresponding to high precision and high recall. 


appearance manifolds. To discover and exploit this structure, we consider a manifold space 
- a vector space in which each point represents an appearance manifold. In the proposed 
method, manifold representations in this space are constructed implicitly. 

We start by computing a symmetric N x N distance matrix D between all pairs of 
appearance manifolds using the method described in the previous section: 

Il(j, j) = CMSM_dist(*, j). (11-11) 

Note that the entries of D do not obey the triangle inequality, i.e. in general: D(i,j) ^ 
D{i, k) + D{i, j). For this reason, we next compute the normalized distance matrix D using 
Floyd’s algorithm [Cor90]: 

Vk. D{i,j) = Tam[D{i,j),b{i,k) + b{k,j)\. (11-12) 

Finally, we employ a Multi-Dimensional Scaling (MDS) algorithm (similarly as Tenenbaum 
et al. [TenOO]) on D to compute the natural embedding of appearance manifolds under the 
derived metric. A typical result of embedding is shown in Figure 11.10. 

Anisotropically evolving class boundaries. Consider previously mentioned clustering 
of appearance manifolds using a particular point on the ROC curve, corresponding to a 
distance threshold dt. It is now easy to see that in the constructed manifold space this 
corresponds to hyper-spherical class boundaries of radius dt centred at each manifold, see 
Figure 11.11. We now show how to construct anisotropic class boundaries by considering 
the distributions of manifolds. First, (i) simple, isotropic clustering in the manifold space is 
performed using the “high precision” point on the ROC curve, then (ii) a single parametric, 
Gaussian model is fit to each provisional same-class cluster of manifolds, and finally (iii) 
Gaussian models corresponding to the provisional classes are merged in a pair-wise manner. 


224 






§11.2 


Automatic Cast Listing in Films 



Figure 11.10: Manifolds in the manifold space (shown are its first 3 principal components), 
corresponding to preprocessed tracks of faces of the two main characters in the situation 
comedy “Yes, Minister”. Each red dot corresponds to a single appearance manifold of Jim 
Hacker and black star to a manifold of Sir Humphrey (samples from two typical manifolds 
are shown below the plot). The distribution of manifolds in the space shows a clear structure. 
In particular, note that intra-class manifold distances are often greater than inter-manifold 
ones. Learning distributions of manifolds provides a much more accurate way of classifica¬ 
tion. 



Figure 11.11: In the manifold space, the usual form of clustering - where manifolds within a 
certain distance (chosen from the ROC curve) from each other are grouped under the same 
class - corresponds to placing a hyper-spherical kernel at each manifold. 
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using a criterion based on the model+data Description Length [DudOO]. The criterion for 
class-cluster merging is explained in detail next. 


Class-cluster merging. In the proposed method, classes are represented by Gaussian 
clusters in the implicitly computed manifold space. Initially, the number of clusters is 
overestimated, each including only those appearance manifolds for which the same-class 
confidence is very high, using the manifold distance corresponding to the “high precision” 
point on the CMSM’s ROC curve. Then, clusters are pair-wise merged. Intuitively, if 
two Gaussian components are quite distant and have little overlap, not much evidence for 
each is needed to decide they represent different classes. The closer they get and the more 
they overlap, more supporting manifolds are needed to prevent merging. We quantify this 
using what we call the weighted Description Length DLyj and merge tentative classes if 
ADLt^ < threshold (we used threshold = —20). 

Let j-th of C appearance manifolds be and let it consist of n{j) face images. Then 
we compute the log-likelihood of given the Gaussian model f/(m; ©) in the manifold 
space, weighted by the number of supporting-samples n(j): 

C C 

Cn{j) logP{m,\&)/'^n{j) (11.13) 

1=1 1=1 


The weighted Description Length of class data under the same model then becomes: 

1 c/E^d) 

(11.14) 


DL^{&,{mj}) =-NElog2{n{j)) - 


1=1 


11.3 Empirical evaluation 

In this section we report the empirical results of evaluating the proposed algorithm on the 
“Open Government” episode of the situation comedy “Yes, Minister”^. Face detection was 
performed on every 5th out of 42,800 frames, producing 7,965 detections, see Figure B.3 (a). 
A large number of non-face images is included in this number, see Figure B.3 (b). Using the 
method for collecting face motion sequences described in Section 11.2.1 and discarding all 
tracks that contain less than 10 samples removes most of these. We end up with approxi¬ 
mately 300 appearance manifolds to cluster. The primary and secondary cast consisted of 7 

^Available at http://iiii.eng.cam.ac.uk/~oa214/acadeniic/ 
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(b) 


Figure 11.12: (a) The ”Yes, Minister” data set - every 70th detection is shown for com¬ 
pactness. A large number of non-faces is present, typical of which are shown in (b). 


characters: Sir Hacker, Miss Hacker, Frank, Sir Humphrey, Bernard, a BBC employee and 
the PM’s secretary. 

Baseline clustering performance was established using the CMSM-based isotropic method 
with thresholds corresponding to the “high recall” and “high precision” points on the ROC 
curve. Formally, two manifolds are classified to the same class if the distance D{i,j) between 
them is less than the chosen threshold, see (11.11) and Figure 11.11. Note that the converse 
is not true due to the transitivity of the in-class relation. 

11.3.1 Results 

The cast listing results using the two baseline isotropic algorithms are shown in Figure 11.13 (a)| 
and 11.13 (b) - for each class we displayed a 10 image sample from its most likely manifold 
(under the assumption of normal distribution, see Section 11.2.2). As expected, the “high 
precision” method produced a gross overestimate of the number of different individuals 
e.g. suggesting three classes both for Sir Hacker and Sir Humphrey, and two for Bernard. 
Conversely, the “high recall” method underestimates the true number of classes. However, 
rather more interestingly, while grouping different individuals under the same class, this re- 
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suit still contains two classes for Sir Hacker. This is a good illustration of the main premise 
of this chapter, showing that the in-class distance threshold has to be chosen locally in the 
manifold space, if high clustering accuracy is to be achieved. That is what the proposed 
method implicitly does. 

The cast listing obtained with anisotropic clustering is shown in Figure 11.14. For each 
class we displayed 10 images from the highest likelihood sequence. It can be seen that the 
our method correctly identified the main cast of the film. No characters are ‘repeated’, unlike 
in both Figure 11.13 (a) and Figure 11.13 (b). This shows that the proposed algorithm for 
growing class boundaries in the manifold space has implicitly learnt to distinguish between 
intrinsic and extrinsic variations between appearance manifolds. Figure 11.15 corroborates 
this conclusion. 

An inspection of the results revealed a particular failure mode of the algorithm, also pre¬ 
dicted from the theory presented in previous sections. Appearance manifolds corresponding 
to the “BBC employee” were classified to the class dominated by Sir Humphrey, see Fig¬ 
ure 11.15. The reason for this is a relatively short appearance of this character, producing 
a small number of corresponding face tracks. Consequently, with reference to (11.13) and 
(11.14), not enough evidence was present to maintain them as a separate class. It is impor¬ 
tant to note, however, that qualitatively speaking this is a tradeoff inherent to the problem 
in question. Under an assumption of isotropic noise in image space, any class in the film’s 
cast can generate any possible appearance manifold - it is enough evidence for each class 
that makes good clustering possible. 

11.4 Summary and conclusions 

A novel clustering algorithm was proposed to automatically determine the cast of a feature- 
length film, without any dataset-specific training information. The coherence of inter- and 
intra-personal dissimilarities between appearance manifolds was exploited by mapping each 
manifold into a single point in the manifold space. Hence, clustering was performed on 
actual appearance manifolds. A mixture-based generative model was used to anisotropically 
grow class boundaries corresponding to different individuals. Preliminary evaluation results 
showed a dramatic improvement over traditional clustering approaches. 


Related publications 

The following publications resulted from the work presented in this chapter: 

• O. Arandjelovic and R. Cipolla. Automatic cast listing in feature-length films with 
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Class 01: 


Class 02: 


Class 03: 


Class 04: 


Class 05: 


Class 06: 


Class 07: 
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Sir Hacker: 


Miss Hacker: 


Humphrey: 


Secretary: 


Bernard: 


Frank: 
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Figure 11.14: Anisotropic clustering results - shown are 10 frame sequences from appear¬ 
ance manifolds most “representative” of the obtained classes (i.e. the highest likelihood ones 
in the manifold space). Our method has correctly identified 6 out of 7 primary and sec¬ 
ondary cast members, without suffering from the problems of the two isotropic algorithms 
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Figure 11.15: Examples from the “Sir Humphrey” cluster - each horizontal strip is a 10 
frame sample from a single face track. Notice a wide range of appearance changes between 
different tracks: extreme illumination conditions, pose and facial expression variation. The 
bottom-most strip corresponds to an incorrectly clustered track of “BBC employee”. 


anisotropic manifold space. In Proc. IEEE Conference on Computer Vision Pattern 
Recognition (CVPR), 2:pages 1513-1520, June 2006. [Ara06a] 
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Conclusion 


This chapter summarizes the thesis. Firstly, we briefly highlight the main contributions of 
the presented research. We then focus on the two major conceptual and algorithmic novelties 
- the Generic Shape-Illumination Manifold recognition and the Anisotropic Manifold Space 
Clustering method. The significance of the contributions of these two algorithms to the 
face recognition field are considered in more detail. Finally, we discuss the limitations of 
the proposed methods and conclude the chapter with an outline of promising directions for 
future research. 


12.1 Contributions 

Each of the chapters 3 to 11 and appendices A to C was the topic of a particular contribution 
to the field of face recognition. For clarity, these are briefly summarized in Figure 12.1. 

We now describe the two main contributions of the thesis in more detail, namely the 
Generic Shape-Illumination Manifold method of Chapter 9 and the Anisotropic Manifold 
Space clustering algorithm of Chapter 11. 

Generic Shape-Illumination Manifold algorithm 

Starting with Chapter 3 and concluding with Chapter 8 we considered the problem of 
matching video sequences of faces, gradually decreasing restrictions on the data acquisition 
process and recognizing using less training data. This ended in our proposing the Generic 
Shape-Illumination Manifold algorithm, in detail described in Chapter 9. The algorithm was 
shown to be extremely successful (nearly perfectly recognizing all individuals) on a large 
data set of over 1300 video sequences in realistic imaging conditions. Repeated explicitly, by 
this we mean that recognition is performed in the presence of: (i) large pose variations, (ii) 
extreme illumination conditions (significant non-Lambertian effects), (iii) large illumination 
changes, (iv) uncontrolled head motion pattern, and (v) low video resolution. 

Our algorithm was shown to greatly outperform current state-of-the-art face recognition 
methods in the literature and the best performing commercial software. This is the result 
of the following main novel features: 

1. Combination of data-driven machine learning and prior knowledge-based photometric 
model, 

2. Concept of the Generic Shape-Illumination Manifold as a way of compactly represent¬ 
ing complex illumination effects across all human faces (illumination robustness), 

3. Video sequence re-illumination algorithm, used to learn the Generic Shape-Illumination| 
Manifold (low resolution robustness), and 
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Chapter 3 

Statistical recognition algorithm suitable in the case when training data 
contains typical appearance variations. 

Chapter 4 

Appearance matching by nonlinear manifold unfolding, in the presence 
of varying pose, noise contamination, face detector outliers and mild 
illumination changes. 

Chapter 5 

Illumination invariant recognition by decision-level fusion of optical and 
infrared thermal imagery. 

Chapter 6 

Illumination invariant recognition by decision-level fusion of raw 
grayscale and image filter preprocessed visual data. 

Chapter 7 

Derivation of a local appearance manifold illumination invariant, ex¬ 
ploited in the proposed learning-based nonlinear extension of canonical 
correlations between subspaces. 

Chapter 8 

Person identification system based on combining appearance manifolds 
with a simple illumination and pose model. 

Chapter 9 

Unified framework for data-driven learning and model-based appear¬ 
ance manifold matching in the presence of large pose, illumination and 
motion pattern variations. 

Chapter 10 

Content-based video retrieval based on face recognition; fine-tuned fa¬ 
cial registration, accurate background clutter removal and robust dis¬ 
tance for partial face occlusion. 

Chapter 11 

Automatic identity-based clustering of tracked people in feature-length 
videos; the manifold space concept. 

Appendix A 

Concept of Temporally-Coherent Gaussian mixtures and algorithm for 
their incremental fitting. 

Appendix B 

Probabilistic extension of canonical correlation-based pattern recogni¬ 
tion by subspace matching. 

Appendix C 

Algorithm for automatic extraction of faces and background removal 
from cluttered video scenes. 


Figure 12.1: A summary of the contributions of this thesis. 


4. Automatic selection of the most reliable faces on which to base the recognition decision 
(pose and outlier robustness). 
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Anisotropic Manifold Space clustering 

The last two chapters of this thesis considered face recognition in feature-length films, for 
the purpose of content-based retrieval and organization. The Anisotropic Manifold Space 
clustering algorithm was proposed to automatically determine the cast of a feature-length 
film, without any dataset-specific training information. 

Preliminary evaluation results on an episode of the situation comedy “Yes, Minister” 
were vastly superior to those of conventional clustering methods. The power of the proposed 
approach was demonstrated by showing that the correct cast list was produced even using a 
very simple algorithm for normalizing images of faces and comparing individual manifolds. 
The key novelties are: 

1. Clustering over appearance manifolds themselves, which were automatically extracted 
from a continuous video stream, 

2. Concept of the manifold space - a vector space in which each point is an appearance 
manifold, 

3. Iterative algorithm for estimating the optimal discriminative subspace for an unla¬ 
belled dataset, given the generic discriminative subspace, and 

4. A hierarchial manifold space clustering algorithm based on the proposed appearance 
manifold-driven weighted description length and an underlying generative mixture 
model. 


12.2 Future work 

We conclude the thesis with a discussion on the most promising avenues for further research 
that the work presented has opened up. We will again focus on the two major contribu¬ 
tions of this work, the Generic Shape-Illumination Manifold method of Chapter 9 and the 
Anisotropic Manifold Space clustering algorithm of Chapter 11. 

Generic Shape-Illumination Manifold algorithm 

The proposed Generic Shape-Illumination Manifold method has immediate potential for 
improvement in the following three areas: 

i Computational efficiency, 

ii Manifold representation, and 
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iii Partial occlusion and facial expression changes. 

In Section 9.4.1 we analyzed the computational complexity and the running time of our 
implementation of the algorithm. Empirical results show a computational increase that is 
dominated by a term roughly quadratic in the number of detected faces in a video sequence. 
The most expensive stages of the method are the computation of geodesic distances and 
iC-nearest neighbours. While neither of these can be made more asymptotically efficient 
(they correspond to the all-pairs shortest path problem [Cor90]), they can be potentially 
avoided if a different manifold representation is employed. Possible candidates are some of 
the representations used in this thesis: Chapters 3 and 8 showed that Gaussian mixtures 
are suitable for modelling face appearance manifolds, while piece-wise linear models were 
employed in Chapter 3. Either of these would have the benefit of (i) constant storage re¬ 
quirements (in our current method, memory needed to represent a manifold is linear in the 
number of faces) and (ii) avoidance of the two most computationally expensive stages in the 
proposed method. Additionally, a novel incremental learning approach of such representa¬ 
tions is described in Appendix A. 

A more fundamental limitation of the Generic Shape-Illumination Manifold algorithm 
is its sensitivity to partial occlusions and facial expression changes. The former is likely 
an easier problem to tackle. Specifically, several recent methods for partial face occlusion 
detection (e.g. [Lee05, Wil04]) may prove useful in this regard: by detecting the occluded 
region of the face, pose matching and then robust likelihood estimation can be performed 
using only the non-occluded regions by marginalization of the density corresponding to the 
Generic SIM. Extending the algorithm to successfully deal with expression changes is a more 
challenging problem and a worthwhile aim for future research. 

Anisotropic Manifold Space clustering 

The Anisotropic Manifold Space algorithm for clustering of face appearance manifolds can 
be extended in the following directions: 

i More sophisticated appearance matching, 

ii The use of local manifold space projection, and 

iii Discriminative model fitting. 

We now summarize these. 

With the purpose of decreasing the computational load of empirical evaluation, as well as 
demonstrating the power of the introduced Manifold Space clustering, our implementation of 
the algorithm in Chapter 11 used a very simple, linear manifold model with per-frame image 
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filtering-based illumination normalization. The limitations of both the linear manifold model 
and the filtering approach to achieving illumination robustness were discussed throughout 
the thesis (e.g. see Chapter 2). A more sophisticated approach, such as one based on the 
proposed Generic Shape-Illumination Manifold of Chapter 9 would be the most immediate 
direction for improvement. 

The proposed Anisotropic Manifold Space algorithm applies MDS to construct an em¬ 
bedding of all appearance manifolds in a feature-length video. This has the unappealing 
consequences of (i) rapidly growing computational load and (ii) decreased accuracy of the 
embedding with the increase in the number of manifolds. Both of these limitations can 
be overcome by recognizing that very distant manifolds should not affect mutual clustering 
membership. Hence, in the future we intend to investigate ways of automatically a priori 
partitioning the Manifold Space and unfolding it only a part at a time i.e. locally. 

Finally, the clustering algorithm in the Manifold Space is based on a generative approach 
with the underlying Gaussian model of class data. Clustering methods better tuned for 
discrimination are likely to prove as more suitable for the task at hand. 
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In this appendix we address the problem of learning Gaussian Mixture Models (GMMs) 
incrementally. Unlike previous approaches which universally assume that new data comes 
in blocks representable by GMMs which are then merged with the current model estimate, 
our method works for the case when novel data points arrive one-by-one, while requiring 
little additional memory. We keep only two GMMs in memory and no historical data. The 
current fit is updated with the assumption that the number of components is fixed, which is 
increased (or reduced) when enough evidence for a new component is seen. This is deduced 
from the change from the oldest fit of the same complexity, termed the Historical GMM, 
the concept of which is central to our method. The performance of the proposed method 
is demonstrated qualitatively and quantitatively on several synthetic data sets and video 
sequences of faces acquired in realistic imaging conditions. 


A.l Introduction 

The Gaussian Mixture Model (GMM) is a semi-parametric method for high-dimensional 
density estimation. It is used widely across different research fields, with applications to 
computer vision ranging from object recognition [DahOl], shape [Goo99a] and face appear¬ 
ance modelling [GroOO] to colour-based tracking and segmentation [Raj98], to name just a 
few. It is worth emphasizing the key reasons for its practical appeal: (i) its flexibility allows 
for the modelling of complex and nonlinear pattern variations [GroOO], (ii) it is simple and 
efficient in terms of memory, (iii) a principled model complexity selection is possible, and (iv) 
there are theoretically guaranteed to converge algorithms for model parameter estimation. 

Virtually all previous work with GMMs has concentrated on non time critical applica¬ 
tions, typically in which model fitting (i.e. model parameter estimation) is performed offline, 
or using a relatively small training corpus. On the other hand, the recent trend in computer 
vision is oriented towards real-time applications (for example for human-computer interac¬ 
tion and on-the-fly model building) and modelling of increasingly complex patterns which 
inherently involves large amounts of data. In both cases, the usual batch fitting becomes 
impractical and an incremental learning approach is necessary. 

Problem challenges. Incremental learning of GMMs is a surprisingly difficult task. One 
of the main challenges of this problem is the model complexity selection which is required 
to be dynamic by the very nature of the incremental learning framework. Intuitively, if 
all information that is available at any time is the current GMM estimate, a single novel 
point never carries enough information to cause an increase in the number of Gaussian 
components. Another closely related difficulty lies in the order in which new data arrives 
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[Hal04]. If successive data points are always badly correlated, then a large amount of data 
has to be kept in memory if accurate model order update is to be achieved. 

A. 1.1 Related previous work 

The most common way of fitting a GMM is using the Expectation-Maximization (EM) al¬ 
gorithm [Dem77]. Starting from an estimate of model parameters, soft membership of data 
is computed (the Expectation step) which is then used to update the parameters in the max¬ 
imal likelihood (ML) manner (the Maximization step). This is repeated until convergence, 
which is theoretically guaranteed. In practice, initialization is frequently performed using 
the iL-means clustering algorithm [Bis95, DudOO]. 

Incremental approaches. Incremental fitting of GMMs has already been addressed in 
the machine learning literature. Unlike the proposed method, most of the existing methods 
assume that novel data arrives in blocks as opposed to a single datum at a time. Hall et al. 
[HalOO] merge Gaussian components in a pair-wise manner by considering volumes of the 
corresponding hyperellipsoids. A more principled method was recently proposed by Song 
and Wang [SonOS] who use the W statistic for covariance and the Hotelling’s statistic for 
mean equivalence. However, they do not fully exploit the available probabilistic information 
by failing to take into account the evidence for each component at the time of merging. 
Common to both [HalOO] and [Son05] is the failure to make use of the existing model when 
the GMM corresponding to new data is htted. What this means is that even if some of 
the new data is already explained well by the current model, the EM fitting will try to 
explain it in the context of other novel data, affecting the accuracy of the fit as well as the 
subsequent component merging. The method of Hicks et al. [Hic03] (also see [Hal04]) does 
not suffer from the same drawback. The authors propose to first “concatenate” two GMMs 
and then determine the optimal model order by considering models of all low complexities 
and choosing the one that gives the largest penalized log-likelihood. A similar approach of 
combining Gaussian components was also described by Vasconcelos and Lippman [Vas98]. 

Model order selection. Broadly speaking, there are three classes of approaches for GMM 
model order selection: (i) EM-based using validation data, (ii) EM-based using model valid¬ 
ity criteria, and (iii) dynamic algorithms. The first approach involves random partitioning 
of the data to training and validation sets. Model parameters are then iteratively estimated 
from training data and the complexity that maximizes the posterior of the validation set is 
sought. This method is typically less preferred than methods of the other two groups, being 
wasteful both of the data and computation time. The most popular group of methods is 
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EM-based and uses the posterior of all data, penalized with model complexity. Amongst 
the most popular are the Minimal Description Length (MDL) [Ris78], Bayesian Information 
(BIG) [Sch78] and Minimal Message Length (MML) [Wal99a] criteria. Finally, there are 
methods which combine the htting procedure with dynamic model order selection. Briefly, 
Zwolinski and Yang [ZwoOl], and Figueredo and Jain [Fig02] overestimate the complex¬ 
ity of the model and reduce it by discarding “improbable” components. Vlassis and Likas 
[Vla99] use a weighted sample kurtoses of Gaussian kernels, while Verbeek et al. introduce a 
heuristic greedy approach in which mixture components are added one at the time [VerOS]. 


A.2 Incremental GMM estimation 

A GMM with M components in a Z)-dimensional embedding space is defined as: 

M 

G (x; d) = Y^ (A.l) 

i=i 

where 0 = {{ai}, {^i}{Ci}) is the set of model parameters, at being the prior of the f-th 
Gaussian component with the mean and covariance Cf. 

A.2.1 Temporally-coherent GMMs 

We assume temporal coherence on the order in which data points are seen. Let {xj} = 
{xo,...,xt} be a stream of data, its temporal ordering implied by the subscript. The 
assumption of an underlying Temporally-Goherent GMM (TG-GMM) on {xt} is: 

xo ~ 0 (x; 0 ) 

xt+i ~ ps(||xt+i-xt|l)-0(x;6») 

where ps is a unimodal density. Intuitively, while data is distributed according to an un¬ 
derlying Gaussian mixture, it is also expected to vary smoothly with time, see Figure A.l. 

A.2.2 Method overview 

The proposed method consists of a three-stage model update each time a new data point 
becomes available, see Figure A.2. At each time step: (i) model parameters are updated 
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(a) (b) 


Figure A.l: (a) Average distribution of Euclidean distance between temporally consecutive 
faces across video sequences of faces in unconstrained motion. The distribution peaks at a 
low, but greater-than-zero distance, which is typical of Temporally-Coherent GMMs analyzed 
in this appendix. Both too low and too large distances are infrequent, in this case the 
former due to the time gap between the acquisition of consecutive video frames, the latter 
due to the smoothness of face shape and texture, (b) A typical sequence projected to the first 
three principal components estimated from the data, the corresponding MDL EM fit and the 
component centres visualized as images. On average, we found that over 80% of pairs of 
successive faces have the highest likelihood of having been generated by the same Gaussian 
component. 


under the constraint of fixed complexity, (ii) new Gaussian components are postulated by 
model splitting and (iii) components are merged to minimize the expected model description 
length. We keep in memory only two GMMs and no historical data. One is the current 
GMM estimate, while the other is the oldest model od the same complexity after which no 
permanent new cluster creation took place - we term this the Historical GMM. 


A.2.3 GMM update for fixed complexity 

In the first stage of our algorithm, the current GMM Q (x; 6) is updated under the constraint 
of fixed model complexity, i.e. fixed number of Gaussian components. We start with the 
assumption that the current model parameters are estimated in the ML fashion in a local 
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Input: set of observations {x,}, 

KPCA space dimensionality D. 
Output: kernel principal components {ui}. 


1: Fixed-complexity update: 

update(f/Ar, x) 

2: Model splitting: 

Qm = split-all(C/Ar,C/^^) 

3: Pair-wise component merging: 

for all (f,j) € (1..N, 1..N) 

4: Expected description length: 

[Li,L2] =DL{merge(C/M,f,j), split (C/m, *, j)} 

5: Complexity update 

Gm = Li < L2 1 merge(C/M,i, j) : split(C/M,j) 


Figure A.2: A summary of the proposed Incremental TC-GMM algorithm. 



Figure A.3: Fixed complexity update: the mean and the covariance of each Gaussian com¬ 
ponent are updated according to the probability that it generated the novel observation (red 
circle). Old covariances are shown as dashed, the updated ones as solid ellipses corresponding 
to component parameters, while historical data points are displayed as blue dots. 


minimum of the EM algorithm: 


Q j — 


N 


h-i = 




c, = 


Sj(xi - - tiifp{i\yij) 


(A.3) 
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where p(i\xj) is the probability of the f-th component conditioned on data point Xj. Simi¬ 
larly, for the updated set of GMM parameters 6* it holds: 

= - An~^ - — 7ri — ^ t 

N+1 (*|xj)+P*(*|x) 

- Mr)V(*|xj) + (x - /r*)(x - n*)^p*{i\x) 

E,P*(*|x,)+P*(*|x) 


The key problem is that the probability of each component conditioned on the data 
changes even for historical data {xj}. In general, the change in conditional probabilities can 
be arbitrarily large as the novel observation x can lie anywhere in the space. However, 
the expected correlation between temporally close points, governed by the underlying TC- 
GMM model allows us to make the assumption that component likelihoods do not change 
much with the inclusion of novel information in the model: 

p*{i\xj) =p{i\xj) (A.6) 

This assumption is further justified by the two stages of our algorithm that follow (Sec¬ 
tions A.2.4 and A.2.5) - a large change in probabilities p{i\xj) occurs only when novel data 
is not well explained by the current model. When enough evidence for a new Gaussian 
components is seen, model complexity is increased, while old component parameters switch 
back to their original value. Using (A.6), a simple algebraic manipulation of (A.3)-(A.4), 
omitted for clarity, and writing J2jPi'^\^j) = leads to the following: 

, E^+p{i\x) , piEi+xp{i\x) 

A^ + 1 “ E,+pii\x) 

^ (C, -L pipf - pipf - + p*pf')E, + {x- p*){x - p*Yp(i\x) 

* EiEp{i\x) 

It can be seen that the update equations depend only on the parameters of the old model and 
the sum of component likelihoods, but no historical data. Therefore the additional memory 
requirements are of the order 0{M), where M is the number of Gaussian components. 
Gonstant-complexity model parameter update is illustrated in Figure A.3. 


(A.7) 

(A.8) 


A.2.4 Model splitting 

One of the greatest challenges of incremental GMM learning is the dynamic model order 
selection. In the second stage of our algorithm, new Gaussian clusters are postulated based 
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on the parameters of the current parameter model estimate Q and the Historical GMM 
which is central to our idea. As, by definition, no permanent model order changes 
occurred between the Historical and the current GMMs, they have the same number of 
components and, importantly, the 1-1 correspondence between them is known (the current 
GMM is merely the Historical GMM that was updated under the constraint of fixed model 
complexity). Therefore, for each pair of corresponding components (p,i, C^) and 
we compute the ‘difference’ component, see Figure A.4 (a-c). Writing (A.3) for the Historical 
and the current GMMs, and using the assumption in (A.6) the i-th difference component 
parameters become: 


(n) Ei — E, 

a, = — 


// ■ F- — 

(n) _ 

N — A^(^) 

Ah) , ,,(.h) T 


c(n) ^ QA.-(cr+Mrvr^)^r + (^rvr+ 


E, - E) 


ih) 


(n) T : {n)'^ (n) (n)^ 


(A.9) 

(A.IO) 


A.2.5 Component merging 

In the proposed method, dynamic model complexity estimation is based on the MDL crite¬ 
rion. Briefly, MDL assigns to a model a cost related to the amount of information necessary 
to encode the model and the data given the model. This cost, known as the description 
length T(0|{xi}), is equal to the log-likelihood of the data under that model penalized by 
the model complexity, measured as the number of free parameters Ne'- 

L (6i|{xJ) = ]^Ne log 2 (Ar) - log 2 P ({xj|6») (A.ll) 

In the case of an M-component GMM with full covariance matrices in space, free 
parameters are {M — 1) for priors, MD for means and MD{D + l)/2 for covariances: 

Ne = M-1 + mb + (A.12) 

The problem is that for the computation of P ({xi}|0) historical data {x^} is needed - 
which is unavailable. Instead of P ({xi}|0), we propose to compute the expected likelihood of 
the same number of data points and, hence, use the expected description length as the model 
order selection criterion. Consider two components with the corresponding multivariate 
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Figure A.4: Dynamic model order selection: (a) Historical GMM. (b) Current GMM before 
the arrival of novel data, (c) New data point (red circle) causes the splitting of a Gaussian 
component, resulting in a 3-component mixture, (d) The contribution to the expected model 
description length for merging and splitting of the component, as the number of novel data 
points is increased. 


Gaussian densities pi(x) ^ A/’(x; p,i, Ci) andp 2 (x) ~ A/’(x; ^ 2 , C 2 ). The expected likelihood 
of A^i points drawn from the former and N 2 from the latter given model aipi(x) + a 2 P 2 (x) 
is: 


E[P{{^,}\0s)] 


pi(x)(aipi(x) + a2P2(x))dx 


P2(x)(aipi(x) + a24'2(x))dx 


iVi 


N2 


(A.13) 


where integrals of the type / pi{x.)pj{x)dx are recognized as related to the Bhattacharyya 
distance, and are for Gaussian distributions easily computed as: 


ds {Pi = J (^) 


exp(—Ar/2) 


(27r)^/2|QCjC|i/2 


(A.14) 
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where: 


C=(C-4-C-)-^ 

(A.15) 

y = C(C-V* + C-V,) 

(A.16) 

K = VJ - 

(A.17) 


On the other hand, consider the case when the two components are merged i.e. replaced 
by a single Gaussian component with the corresponding density p(x). Then we compute the 
expected likelihood of points drawn from pi(x) and N 2 points drawn from P 2 (x;), given 
model p(x): 


Ni 


N2 


E[P{{xj}\dM)] = p(x)pi(x)dxj • l^J p(x)p2(x)dxj 

Substituting the expected evidence and model complexity in (A.11) we get: 

AE[L] = E[Ls] - E[Lm] = ^D{D + 1) log^iN^ + N 2 )- 
log2 + log2 E[P{{Xj}\0m)] 


(A.18) 


(A.19) 


Then the condition for merging is simply AE[L] > 0, see Figure A.4 (d). Merging equations 
are virtually the same as (A.9) and (A. 10) for model splitting, so we do not repeat them. 


A.3 Empirical evaluation 

The proposed method was evaluated on several synthetic data sets and video sequences of 
faces in unconstrained motion, acquired in realistic imaging conditions and localized using 
the Viola-Jones face detector [Vio04], see Figure A.l (b). Two synthetic data sets that we 
illustrate its performance on are: 

1. 100 points generated from a Gaussian with a diagonal covariance matrix in radial 

coordinates: r ^ J\f{r = 5, = 0.1), (j) ~ = 0, cr^ = 0.7) 

2. 80 points generated from a uniform distribution in x and a Gaussian noise perturbed 
sinusoid in y coordinate : x ^ W(mina; = 0,maxa; = 10), y ^JV{y = sina:,CTy = 0.1) 

Temporal ordering was imposed by starting from the data point with the minimal x coordi¬ 
nate and then iteratively choosing as the successor the nearest neighbour out of yet unused 
points. The initial GMM parameters, the final fitting results and the comparison with the 
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MDL-EM fitting are shown in Figure A.5. In the case of face motion video sequences, tem¬ 
poral ordering of data is inherent in the acquisition process. An interesting fitting example 
is shown and compared with the MDL-EM batch approach in Figure A.6. 

Qualitatively, both in the case of synthetic and face data it can be seen that our al¬ 
gorithm consistently produces meaningful GMM estimates. Quantitatively, the results are 
comparable with the widely accepted EM fitting with the underlying MDL criterion, as 
witnessed by the description lengths of the obtained models. 

Failure modes. On our data sets two types phenomena in data sometimes caused unsatis¬ 
factory fitting results. The first, one inherently problematic to our algorithm, is when newly 
available data is well explained by the Historical GMM. Referring back to Section A.2.4, it 
can be seen in (A.9) and (A. 10) that this data contributes to the confidence of creating a new 
GMM component whereas it should not. The second failure mode was observed when the 
assumption of temporal coherence (Section A.2.1) was violated, e.g. when our face detector 
failed to detect faces in several consecutive video frames. While this cannot be considered 
an inherent fault of our algorithm, it does point out that ensuring temporal coherence of 
data is not always a trivial task in practice. 

In conclusion, while promising, a more comprehensive evaluation on different sets of real 
data is needed to fully understand the behaviour of the proposed method. 

A.4 Summary and conclusions 

A novel algorithm for incremental learning of Temporally-Coherent Gaussian mixtures was 
introduced. Promising performance was empirically demonstrated on synthetic data and 
face appearance streams extracted from realistic video, and qualitatively and quantitatively 
compared with the standard EM-based fitting. 


Related publications 

The following publications resulted from the work presented in this appendix: 

• O. Arandjelovic and R. Gipolla. Incremental learning of temporally-coherent Gaussian 
mixture models. In Proc. lAPR British Machine Vision Conference (BMVC), 2:pages 
759-768, September 2005. [Ara05a] 

• O. Arandjelovic’ and R. Gipolla. Incremental learning of temporally-coherent Gaus¬ 
sian mixture models. Society of Manufacturing Engineers (SME) Technical Papers, 
May 2006. [Ara06d] 
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(a) Synthetic data set 1 


Figure A.5: Synthetic data: (1) data (dots) and the initial model (visualized as ellipses 
corresponding to the parameters of the Gaussian components). (2) MDL-EM GMM fit. (3) 
Incremental GMM fit. (4) Description length of GMMs fitted using EM and the proposed 
incremental algorithm (shown is the description length of the final GMM estimate). Our 
method produces qualitatively meaningful results which are also qualitatively comparable with 
the best fits obtained using the usual batch method. 
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Figure A.5: Synthetic data: (1) data (dots) and the initial model (visualized as ellipses 
corresponding to the parameters of the Gaussian components). (2) MDL-EM GMM fit. (3) 
Incremental GMM fit. (4) Description length of GMMs fitted using EM and the proposed 
incremental algorithm (shown is the description length of the final GMM estimate). Our 
method produces qualitatively meaningful results which are also qualitatively comparable with 
the best fits obtained using the usual batch method. 
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(C) (d) 


Figure A.6: Face motion data: data (dots) and (a) MDL-EM GMM fit. (b) Incremental 
GMM fit. (c) Description length of GMMs fitted using EM and the proposed incremental 
algorithm (shown is the description length of the final GMM estimate), (d) GMM compo¬ 
nent centres visualized as images for the MDL-EM fit (top) and the incremental algorithm 
(bottom). 
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Maximally Probable Mutual Modes 



Salvador Dali. Archeological Reminiscence of Millet’s Angelas 
1933-5, Oil on panel, 31.7 x 39.3 cm 
Salvador Dali Museum, St. Petersburg, Florida 
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(a) (b) 


Figure B.l: Piece-wise representations of nonlinear manifolds: as a collection of (a) infinite- 
extent linear subspaces vs. (b) Gaussian densities. 


In this appendix we consider discrimination between linear patches corresponding to lo¬ 
cal appearance variations within face image sets. We propose the Maximally Probable Mu¬ 
tual Modes (MMPM) algorithm, a probabilistic extension of the Mutual Subspace Method 
(MSM). Specifically we show how the local manifold illumination invariant introduced in 
Section 7.1 naturally leads to a formulation of “common modes” of two face appearance 
distributions. Recognition is then performed by finding the most probable mode, which is 
shown to be an eigenvalue problem. The effectiveness of the proposed method is demon¬ 
strated empirically on the CamFace dataset. 

B.l Introduction 

In Section 7.1 we proposed a piece-wise linear representation of face appearance variation as 
suitable for exploiting the identified local manifold illumination invariant. Recognition by 
comparing nonlinear appearance manifolds was thus reduced to the problem of comparing 
linear patches, which was performed using canonical correlations. Here we address the 
problem of comparing linear patches in more detail and propose a probabilistic extension to 
the concept of canonical correlations. 

B.1.1 Maximally probably mutual modes 

In Chapter 7, linear patches used to piece-wise approximate an appearance manifold were 
represented by linear subspaces, much like in the Mutual Subspace Method (MSM) of Fukui 
and Yamaguchi [Fuk03]. The patches themselves, however, are finite in extent and are hence 
better characterized by probability density functions, such as Gaussian densities. This is 
the approach we adopt here, see Figure B.l. 
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Unlike in the case when dealing with subspaces, in general both of the compared distri¬ 
butions can generate any point in the U-dimensional embedding space. Hence, the concept 
of the most-correlated patterns (c.f. canonical correlations) from the two classes is not mean¬ 
ingful in this context. Instead, we are looking for a mode - i.e. a linear direction in the 
pattern space - along which both distributions corresponding to the two classes are most 
likely to “generate” observations. 

We define the mutual probability Pm(x) to be the product of two densities at x; 

p„(x) =pi(x)p2(x). (B.l) 

Generalizing this, the mutual probability of an entire linear mode v is then: 

p + OC 

5'v = / pi{x'v)p2{x'v)dx. (B.2) 


Substituting ( 27 i-)g/ 2 |c |i /2 forpi(x), we obtain: 


^ + CXD 


5v = 


(27r)W2|Ci|i/2 


exp 


— -XV vx 


1 


exp 


(2^)W2|C2|1/2 
(2^)^|CiC2|1/2 


— ^vx 


dx = 




exp 


--x^v^ (cr^ 




dx. 


(B.3) 

(B.4) 


Noting that the integral is now over a ID Gaussian distribution (up to a constant): 


(B.5) 

(B.6) 


The expression above favours directions in which both densities have large variances, i.e. 
in which Signal-to-Noise ratio is the highest, as one may intuitively expect, see Figure B.4. 

The mode that maximizes the mutual probability S'v can be found by considering eigen¬ 
value decomposition of -I- Writing: 


D 

Cr' + C2-'=^A.u,uf, 
2=1 


(B.7) 
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Figure B.2: Conceptual drawing of the Maximally Probable Mutual Mode eoncept for 2D 
Gaussian densities. 



where 0 < Ai < A 2 < ... < Ad and 


Ui -Uj =0,iy^ j 

Ui • Uj = 1. (B.8) 


and since {u^} span K-®: 


D 

v = ^aiUj, (B.9) 

i^l 

it is then easy to show that the maximal value of (B.6) is: 

maxS-v = (27r)l/2-'°|ClC2^^/^Ay''^ (B.IO) 

This defines the class similarity score u. It is achieved for ai^....D-i = 0, ao = 1 or v = ud, 
i.e. the direction of the eigenvector corresponding to the smallest eigenvalue ofCj"^ + C^^. 
A visualization of the most probable mode between two face sets of Figure B.3 is shown in 
Figure B.4. 
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(a) [PI, II] - [PI, 12] (b) [PI, II] - [P2, II] 


Figure B.4: The maximally probable mutual mode, shown as an image, when two compared 
face sets belong to the (a) same and (b) different individuals (also see Figure B.3). 


B.1.2 Numerical and implementation issues 


The expression for the similarity score v = maxv S'v in (B.IO) involves the computation of 
|CiC 2 |“^/^. This is problematic as C 1 C 2 may be a singular, or a nearly singular, matrix 
(e.g. because the number of face images is much lower than the image space dimensionality 
D). 


We solve this problem by assuming that the dimensionality of the principal linear sub¬ 
spaces corresponding to Ci and C 2 is M <C D, and that data is perturbed by isotropic 
Gaussian noise. If < ■ ■ ■ < are the eigenvalues of C^: 

Vj>M. A«^. =AgL,. (B.ll) 


Then, writing 

D 

|C.| = n^?^ (B-12) 

i=i 

we get: 


= (2^)i/2-^|CiC2r^/2A-^/" = 

= const X (xd 


- 1/2 


(B.13) 

(B.14) 
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Table B.l: Recognition performance statistics (%)■ 


Method 


MPMM 

MSM 

Recognition rate 

average 

92.0 

58.3 


std 

7.8 

24.3 



B.2 Experimental evaluation 

We demonstrate the superiority of the Maximally Probable Mutual Modes to the Mutual 
Subspace Method [Fuk03] on the CamFace data set using the Manifold Principal Angles 
algorithm of Chapter 7. With the purpose of focusing on the underlying comparison of 
linear subspaces we omit the contribution of global appearance in the overall manifold 
similarity score by setting a = 0 in (7.8). 

A summary of the results is shown in Table B.l with the Receiver-Operator Character¬ 
istics (ROC) curve for the MPMM method in Figure B.5. The proposed method achieved 
a significantly higher average recognition rate than the original MSM algorithm. 


B.3 Summary and conclusions 

We described a probabilistic extension to the concept of canonical correlations which has 
been widely used in the pattern recognition literature. The resulting method was demon¬ 
strated suitable for matching local appearance variations between face sets, exploiting a 
manifold illumination invariant. 
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Related publications 

The following publications resulted from the work presented in this appendix: 

• O. Arandjelovic and R. Cipolla. Face set classification using maximally probable mu¬ 
tual modes. In Proc. IEEE International Conference on Pattern Recognition (ICPR), 
pages 511-514, August 2006. [Ara06c] 
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Camille Pissarro. Boulevard Montmartre 
1897, Oil on canvas, 74 x 92.8 cm 
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The CamFace data set 


Table C.l: The proportion of the two genders in the CamFace dataset. 


Gender 

Male 

Female 

Number 

67 

33 



Figure C.l: The distribution of people’s ages across the CamFace data set. 

The University of Cambridge Face database {CamFace database) is a collection of video 
sequences of largely unconstrained, random head movement in different illumination con¬ 
ditions, acquired for the purpose of developing and evaluating face recognition algorithms. 
This appendix describes (i) the database and its acquisition, and (ii) a novel method for 
automatic extraction of face images from videos of head motion in a cluttered environ¬ 
ment, suitable as a preprocessing step to recognition algorithms. The database and the 
preprocessing described are used extensively in this thesis. 


C.l Description 

The CamFace data set is a database of face motion video sequences acquired in the Depart¬ 
ment of Engineering, University of Cambridge. It contains 100 individuals of varying age, 
ethnicity and gender, see Figure C.l and Table C.l. 

For each person in the database we collected 14 video sequences of the person in quasi¬ 
random motion. We used 7 different illumination configurations and acquired 2 sequences 
with each for a given person, see Figure C.2. The individuals were instructed to approach the 
camera and move freely, with the loosely enforced constraint of being able to see their eyes 
on the screen providing visual feedback in front of them, see Figure C.3 (a). Most sequences 
contain significant yaw and pitch variation, some translatory motion and negligible roll. Mild 
facial expression changes are present in some sequences (e.g. when the user was smiling or 
talking to the person supervising the acquisition), see Figure C.4. 


269 










The CamFace data set 


§C.l 



(b) 


Figure C.2: (a) Illuminations 1-7 from the CamFace data set. (b) Five different individuals 
in the illumination setting number 6. In spite of the same spatial arrangement of light 
sources, their effect on the appearance of faces changes significantly due to variations in 
people’s heights and the ad lib chosen position relative to the camera. 



(a) (b) 


Figure C.3: (a) Visual feedback displayed to the user during data acquisition, (b) The pin¬ 
hole camera used to collect the CamFace data set. 
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The CamFace data set 



Figure C.4: A 100 frame, 10 fps video sequence typical for the CamFace data set. The user 
positions himself ad lib and performs quasi-random head motion. Although instructed to 
keep head pose variations within the range in which the eyes are clearly visible, note that a 
significant number of poses does not meet this requirement. 

Table C.2: An overview of CamFace data set statistics. 





Sequences per 

Frames per 


Individuals 

Illuminations 

illumination 
per person 

second (fps) 

Number 

too 

7 

2 

10 


Acquisition hardware. Video sequences were acquired using a simple pin-hole camera 
with automatic gain control, mounted at 1.2m above the ground and pointing upwards at 
30 degrees to the horizontal, see Figure C.3. Data was acquired at lOfps, giving 100 frames 
for each 10s sequence, in 320 by 240 pixel resolution, see Figure C.4 for an example and 
Table C.2 for a summary. On average, the face occupies an area of 60 by 60 pixels. 


C.2 Automatic extraction of faces 

We used the Viola-Jones cascaded detector [Vio04] in order to localize faces in cluttered 
images. Figure C.4 shows examples of input frames. Figure C.5 (b) an example of a correctly 
detected face and Figure C.6 all detected faces in a typical sequence. A histogram of the 
number of detections we get across the entire data set is shown in Figure C.7. 

Rejection of false positives. The face detector achieves high true positive rates for our 
database. A larger problem is caused by false alarms, even a small number of which can 
affect the density estimates. We use a coarse skin colour classifier to reject many of the false 
detections. The classifier is based on 3-dimensional colour histograms built for two classes: 
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(a) (b) (c) (d) (e) 


Figure C.5: Illustration of the described preprocessing pipeline, (a) Original input frame 
with resolution of 320 x 240 pixels, (b) Face detection with average bounding box .size of 
75 X 75 pixels, (c) Resizing to the uniform scale of 40 x 40 pixels, (d) Background removal 
and feathering, (e) The final image after histogram equalization. 



Figure C.6: Per-frame face detector output from a typical 100 frame, 10 fps video sequence 
(also see Figure C.4). The detector is robust to a rather large range of pose changes. 
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Figure C.7: A histogram of the number of face detections per sequence across the CamFace 
data set. 



skin and non-skin pixels [Jon99]. A pixel can then be classified by applying the likelihood 
ratio test. We apply this classifier and reject detections in which too few (< 60%) or too 
many (> 99%) pixels are labelled as skin. This step removes the vast majority of non-faces 
as well as faces with grossly incorrect scales - see Figure C.8 for examples of successfully 
removed false positives. 


Background clutter removal. The bounding box of a detected face typically contains 
a portion of the background. The removal of the background is beneficial because it can 
contain significant clutter and also because of the danger of learning to discriminate based 
on the background, rather than face appearance. This is achieved by set-specific skin colour 
segmentation: Given a set of images from the same subject, we construct colour histograms 
for that subject’s face pixels and for the near-face background pixels in that set. Note that 
the classifier here is tuned for the given subject and the given background environment, 
and thus is more “refined” than the coarse classifier used to remove false positives. The 
face pixels are collected by taking the central portion of the few most symmetric images in 
the set (assumed to correspond to frontal face images); the background pixels are collected 
from the 10 pixel-wide strip around the face bounding box provided by the face detector, 
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Figure C.9: The response of our vertical symmetry-based measure of the ‘frontality” of a 
face, used to select the most reliable faces for extraction of background and foreground colour 
models. Also see Figures C.IO and C.ll. 


see Figure C.IO. After classifying each pixel within the bounding box independently, we 
smooth the result using a simple 2-pass algorithm that enforces the connectivity constraint 
on the face and boundary regions, see Figure C.5 (d). A summary of the cascade in its 
entirety is shown in Figure C.ll. 


Related publications 

The following publications contain portions of work presented in this appendix: 

• O. Arandjelovic, G. Shakhnarovich, J. Fisher, R. Cipolla, and T. Darrell. Face recog¬ 
nition with image sets using manifold density divergence. In Proc. IEEE Conference 
on Computer Vision and Pattern Recognition (CVPR), l:pages 581-588, June 2005. 
[Ara05b] 

• O. Arandjelovic and R. Cipolla. An information-theoretic approach to face recognition 
from face motion manifolds. Image and Vision Computing (special issue on Face 
Processing in Video Sequences), 24(6):pages 639-647, June 2006. [Ara06e] 
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(a) (b) (c) 


Figure C.IO: (a) Areas used to sample face and background colours, and the corresponding 
(h) face and (b) background histograms in RGB space used for ML skin-colour detection. 
Larger blobs correspond to higher densities and are colour-coded. 
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Figure C.ll: A schematic representation of the face localization and normalization cascade. 
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